Les Mikesell
2009-Aug-28 17:20 UTC
[CentOS] OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers
Does anyone have experience with linux tools to parse the text from common non-text file formats for searching? I'm trying to use the kinosearch add-on for twiki which is fine as far as the search goes, but it takes forever to generate the index. It uses xpdf to extract strings from pdf's, antiword for .doc, and since it is perl, the Spreadsheet::ParseExcel module for .xls. Some documents parse/index quickly, some extremely slowly, and in the .xls case some seem to hang forever. I think the real issue is when the parsers (correctly or incorrectly) detect a wide character set and the indexer is confused when trying to re-encode it. What is the best approach to debug something that might be in the perl character set handlers? -- Les Mikesell lesmikesell at gmail.com
On Fri, Aug 28, 2009 at 7:20 AM, Les Mikesell<lesmikesell at gmail.com> wrote:> Does anyone have experience with linux tools to parse the text from > common non-text file formats for searching?http://www.google.com/url?q=http://en.wikipedia.org/wiki/Pdftotext&ei=qsOZSreGOI_WtgOWooiiAg&sa=X&oi=spellmeleon_result&resnum=2&ct=result&usg=AFQjCNENpVi7xahbHDxv1oQm-gde8G2qIw ?
Rajagopal Swaminathan
2009-Aug-30 03:36 UTC
[CentOS] OT: .doc,.xls,.pdf,.ppt (etc.) string parser/indexers
Greetings, On Fri, Aug 28, 2009 at 10:50 PM, Les Mikesell<lesmikesell at gmail.com> wrote:> Does anyone have experience with linux tools to parse the text from > common non-text file formats for searching? ?I'm trying to use the > kinosearch add-on for twiki which is fine as far as the search goes, but > it takes forever to generate the index.I am not sure this answers your query to the point. But I have seen Lucene .net SDK (With extensions to scour .doc, .odt, .pdf etc.) to very good effect and pretty decent performance. HTH Thanks and Regards Rajagopal