The documents we want to index come in many formats; e.g., HTML, PDF, RTF, Word, Excel, etc., etc., etc. I''ve been searching to find parsers that will translate each of these formats to indexable text, but have had little success. Any help will be appreciated. -- Posted via http://www.ruby-forum.com/.
Hi Dick, you may need to turn to using some external tools. something similar to this was discussed before and some tools suggested. See: http://www.ruby-forum.com/topic/103374 assuming the text is stored ASCII single byte, you could fall back on the "strings" command as a last resort. It should be installed already on modern GNU/Linux distros. Try cygwin for windows. It reads in any data and outputs all "printable character sequences". John. On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:> The documents we want to index come in many formats; e.g., HTML, PDF, > RTF, Word, Excel, etc., etc., etc. I''ve been searching to find parsers > that will translate each of these formats to indexable text, but have > had little success. Any help will be appreciated. >-- http://johnleach.co.uk
Hello Dick, and all (first post), Here are some more that I use: HTML to text: Vilistextum http://bhaak.dyndns.org/vilistextum/ also lynx: http://lynx.browser.org/ PDF to text: pdftotext, from Xpdf http://www.foolabs.com/xpdf/ WordPerfect to text: wpd2text, from libwpd http://libwpd.sourceforge.net/ Converting other text encodings: iconv http://www.gnu.org/software/libiconv/ -Stuart Sierra John Leach wrote:> you may need to turn to using some external tools. > > something similar to this was discussed before and some tools suggested. > > See: http://www.ruby-forum.com/topic/103374 > > On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote: >> The documents we want to index come in many formats; e.g., HTML, PDF, >> RTF, Word, Excel, etc., etc., etc. I''ve been searching to find parsers >> that will translate each of these formats to indexable text, but have >> had little success. Any help will be appreciated.