thr3ads.net - Ferret talk - [Ferret-talk] Parsers for input to index? [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Dick Monahan

2007-Apr-25 17:14 UTC

[Ferret-talk] Parsers for input to index?

The documents we want to index come in many formats;  e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc.  I''ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success.  Any help will be appreciated.

-- 
Posted via http://www.ruby-forum.com/.

John Leach

2007-Apr-25 17:40 UTC

head link

[Ferret-talk] Parsers for input to index?

Hi Dick,

you may need to turn to using some external tools.

something similar to this was discussed before and some tools suggested.

See: http://www.ruby-forum.com/topic/103374

assuming the text is stored ASCII single byte, you could fall back on
the "strings" command as a last resort.  It should be installed
already
on modern GNU/Linux distros.  Try cygwin for windows.  It reads in any
data and outputs all "printable character sequences".

John.

On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:> The documents we want to index come in many formats;  e.g., HTML, PDF,
> RTF, Word, Excel, etc., etc., etc.  I''ve been searching to find
parsers
> that will translate each of these formats to indexable text, but have
> had little success.  Any help will be appreciated.
> -- 
http://johnleach.co.uk

Stuart Sierra

2007-Apr-25 18:08 UTC

head link

[Ferret-talk] Parsers for input to index?

Hello Dick, and all (first post),

Here are some more that I use:

HTML to text: Vilistextum
http://bhaak.dyndns.org/vilistextum/
also lynx:
http://lynx.browser.org/

PDF to text: pdftotext, from Xpdf
http://www.foolabs.com/xpdf/

WordPerfect to text: wpd2text, from libwpd
http://libwpd.sourceforge.net/

Converting other text encodings: iconv
http://www.gnu.org/software/libiconv/

-Stuart Sierra


John Leach wrote:> you may need to turn to using some external tools.
> 
> something similar to this was discussed before and some tools suggested.
> 
> See: http://www.ruby-forum.com/topic/103374
> 
> On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:
>> The documents we want to index come in many formats;  e.g., HTML, PDF,
>> RTF, Word, Excel, etc., etc., etc.  I''ve been searching to
find parsers
>> that will translate each of these formats to indexable text, but have
>> had little success.  Any help will be appreciated.

Seemingly Similar Threads

Search for more apparently analagous threads

Ferret talk - Apr 2007 - Parsers for input to index?

[Ferret-talk] Parsers for input to index?

[Ferret-talk] Parsers for input to index?

[Ferret-talk] Parsers for input to index?

Seemingly Similar Threads