I was wondering if it is possible to search word documents using ferret. The actual text in a word document isn''t in a binary format - only the formatting. Surely it would be possible to parse that? -- Posted via http://www.ruby-forum.com/.
Charlie Hubbard
2006-Nov-18 15:33 UTC
[Ferret-talk] acts_as_ferret and searching word docs
Alex MacCaw wrote:> I was wondering if it is possible to search word documents using ferret. > The actual text in a word document isn''t in a binary format - only the > formatting. Surely it would be possible to parse that?You might be able to use some of the extensions for M$ platform and ruby to use COM to get the data. Or if you don''t want to run on M$ platform you could possibly use Java''s POI from Jakarta to parse out the text and put it into something that Ruby could then put into ferret. Charlie -- Posted via http://www.ruby-forum.com/.
Charlie Hubbard wrote:> Alex MacCaw wrote: >> I was wondering if it is possible to search word documents using ferret. >> The actual text in a word document isn''t in a binary format - only the >> formatting. Surely it would be possible to parse that? > > You might be able to use some of the extensions for M$ platform and ruby > to use COM to get the data. Or if you don''t want to run on M$ platform > you could possibly use Java''s POI from Jakarta to parse out the text and > put it into something that Ruby could then put into ferret. > > Charlie >Or there''s Abiword - runs on all platforms, and ouputs nice text. If you don''t want graphical dependencies, there''s wvWare, too. I''m using it at the moment. -- Alex
On Sat, Nov 18, 2006 at 04:33:26PM +0100, Charlie Hubbard wrote:> Alex MacCaw wrote: > > I was wondering if it is possible to search word documents using ferret. > > The actual text in a word document isn''t in a binary format - only the > > formatting. Surely it would be possible to parse that? > > You might be able to use some of the extensions for M$ platform and ruby > to use COM to get the data. Or if you don''t want to run on M$ platform > you could possibly use Java''s POI from Jakarta to parse out the text and > put it into something that Ruby could then put into ferret.I successfully used the wv-utilities (wvText or wvHtml, on debian do ''apt-get install wv'') to index word documents with Ferret. you can have a look at RDig (http://rubyforge.org/projects/rdig) to see an example of how this could be done. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
> I successfully used the wv-utilities (wvText or wvHtml, on debian do > ''apt-get install wv'') to index word documents with Ferret.Thanks Jens, Is there any way to do this on windows - or I''ll just have to wait till I deploy on linux. -- Posted via http://www.ruby-forum.com/.