thr3ads.net - Ferret talk - [Ferret-talk] Error decoding input string. [May 2008]

If this information is useful, please help other people find it:
Share via:

Eric Schulte

2008-May-19 16:00 UTC

[Ferret-talk] Error decoding input string.

Hi,

I am trying to index a number of Spanish language text files, but a
large fraction of the files are generating errors like the
following...

Error: exception 2 not handled: Error decoding input string. Check that you have
the locale set correctly

however it looks to me like my locale matches the file type.  Running
the file command on the files returns

$ file /media/.../raw/abc/20Jan2007_abc_001041_67.es
/media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text

and my locale is

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL

after enough of these errors are generated, I begin to get errors for
having too many open files, and the indexing fails.

Error: exception 2 not handled: Too many open files

Any suggestions would be greatly appreciated.

Thanks,
Eric

Jens Kraemer

2008-May-19 21:15 UTC

head link

[Ferret-talk] Error decoding input string.

Hi!

Are you *sure* this is all valid UTF8? I dont know how the file  
command determines this, and if it always is right.
Maybe try to play around with iconv to ensure whatever you send to  
Ferret really is UTF8.

Cheers,
Jens

On 19.05.2008, at 18:00, Eric Schulte wrote:
> Hi,
>
> I am trying to index a number of Spanish language text files, but a
> large fraction of the files are generating errors like the
> following...
>
> Error: exception 2 not handled: Error decoding input string. Check  
> that you have the locale set correctly
>
> however it looks to me like my locale matches the file type.  Running
> the file command on the files returns
>
> $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es
> /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text
>
>
> and my locale is
>
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL>
>
> after enough of these errors are generated, I begin to get errors for
> having too many open files, and the indexing fails.
>
> Error: exception 2 not handled: Too many open files
>
> Any suggestions would be greatly appreciated.
>
> Thanks,
> Eric
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>
--
Jens Kr?mer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

Eric Schulte

2008-May-20 18:00 UTC

head link

[Ferret-talk] Error decoding input string.

Hi Jens,

Thanks for the reply!

I used iconv (thanks for the pointer, I had no idea this tool existed)
and was able to convert all of the articles to and from utf8 without
any errors being generated, so I am pretty sure that the input sources
are valid utf8.

I should mention that I am using an old version of ferret.  v.0.9.6
which is the last version to have a pure-ruby implementation.  I''m
using this version because I have added in some changes which allow me
to specify the scoring algorithm used on a per-search basis.  I
haven''t however made any changes to the indexing portion of the
application.

I current have an iconv script creating transliterated ASCII copies of
all my articles, so I am going to try to index over these.  Also, I am
thinking of trying to index using Lucene since there is a chance that
the older version of ferret is compatible with lucene indexes.

If you have any other suggestions I''d love to hear them, but I
understand that I can''t expect much help with such an old version.  Do
you know of a way to specify custom scoring algorithms in the current
versions of ferret?

Best,
Eric

On Monday, May 19, at 23:15, Jens Kraemer wrote:
 > Hi!
 > 
 > Are you *sure* this is all valid UTF8? I dont know how the file  
 > command determines this, and if it always is right.
 > Maybe try to play around with iconv to ensure whatever you send to  
 > Ferret really is UTF8.
 > 
 > Cheers,
 > Jens
 > 
 > On 19.05.2008, at 18:00, Eric Schulte wrote:
 > 
 > > Hi,
 > >
 > > I am trying to index a number of Spanish language text files, but a
 > > large fraction of the files are generating errors like the
 > > following...
 > >
 > > Error: exception 2 not handled: Error decoding input string. Check  
 > > that you have the locale set correctly
 > >
 > > however it looks to me like my locale matches the file type.  Running
 > > the file command on the files returns
 > >
 > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es
 > > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text
 > 
 > 
 > >
 > >
 > > and my locale is
 > >
 > > $ locale
 > > LANG=en_US.UTF-8
 > > LC_CTYPE="en_US.UTF-8"
 > > LC_NUMERIC="en_US.UTF-8"
 > > LC_TIME="en_US.UTF-8"
 > > LC_COLLATE="en_US.UTF-8"
 > > LC_MONETARY="en_US.UTF-8"
 > > LC_MESSAGES="en_US.UTF-8"
 > > LC_PAPER="en_US.UTF-8"
 > > LC_NAME="en_US.UTF-8"
 > > LC_ADDRESS="en_US.UTF-8"
 > > LC_TELEPHONE="en_US.UTF-8"
 > > LC_MEASUREMENT="en_US.UTF-8"
 > > LC_IDENTIFICATION="en_US.UTF-8"
 > > LC_ALL > >
 > >
 > > after enough of these errors are generated, I begin to get errors for
 > > having too many open files, and the indexing fails.
 > >
 > > Error: exception 2 not handled: Too many open files
 > >
 > > Any suggestions would be greatly appreciated.
 > >
 > > Thanks,
 > > Eric
 > > _______________________________________________
 > > Ferret-talk mailing list
 > > Ferret-talk at rubyforge.org
 > > http://rubyforge.org/mailman/listinfo/ferret-talk
 > >
 > 
 > --
 > Jens Kr?mer
 > Finkenlust 14, 06449 Aschersleben, Germany
 > VAT Id DE251962952
 > http://www.jkraemer.net/ - Blog
 > http://www.omdb.org/     - The new free film database
 > 

-- 
schulte

Eric Schulte

2008-May-29 16:44 UTC

head link

[Ferret-talk] Error decoding input string.

Hi,

So I''ve tried switching to the latest version of Ferret (0.11.06), but
I am still getting the following errors.

,----
| Error: exception 2 not handled: Error decoding input string. Check that you
have the locale set correctly
| 	from spanish_indexer.rb:45
| 	from spanish_indexer.rb:38:in `each''
| 	from spanish_indexer.rb:38
`----

The articles are recognized as valid utf8 using iconv, and I believe
my locale is set properly

,----
| LANG=en_US.UTF-8
| LC_CTYPE="en_US.UTF-8"
| LC_NUMERIC="en_US.UTF-8"
| LC_TIME="en_US.UTF-8"
| LC_COLLATE="en_US.UTF-8"
| LC_MONETARY="en_US.UTF-8"
| LC_MESSAGES="en_US.UTF-8"
| LC_PAPER="en_US.UTF-8"
| LC_NAME="en_US.UTF-8"
| LC_ADDRESS="en_US.UTF-8"
| LC_TELEPHONE="en_US.UTF-8"
| LC_MEASUREMENT="en_US.UTF-8"
| LC_IDENTIFICATION="en_US.UTF-8"
| LC_ALL`----

what''s weird here is that the errors don''t always happen on
the same
articles, if I try to run indexing three times, printing out the
articles that throw this error, I get a different list of articles
each time.

In fact I just changed my indexing script so that it keeps trying to
index failed articles

,----
| # ind      is my index
| # 
| # add_arts is a method which takes a list of articles, tries to
| #          index them, and returns a list of the articles that
| #          threw errors during indexing
| # 
| puts art_paths.size.to_s + "articles"
| missed = add_arts(art_paths, ind)
| while missed.size > 0
|   missed = add_arts(missed, ind)
|   puts missed.size
| end
`----

and I was able to index all of the articles with the following output

,----
| 5843 articles
| 34
| 16
| 10
| 9
| 7
| 7
| 6
| 1
| 0
`----

any ideas what could be causing this non-deterministic behavior?

Thanks,
Eric

-- 
schulte

Reasonably Related Threads

Search for more reasonably related threads

Ferret talk - May 2008 - Error decoding input string.

[Ferret-talk] Error decoding input string.

[Ferret-talk] Error decoding input string.

[Ferret-talk] Error decoding input string.

[Ferret-talk] Error decoding input string.

Reasonably Related Threads