Hi, I am trying to index a number of Spanish language text files, but a large fraction of the files are generating errors like the following... Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly however it looks to me like my locale matches the file type. Running the file command on the files returns $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text and my locale is $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL after enough of these errors are generated, I begin to get errors for having too many open files, and the indexing fails. Error: exception 2 not handled: Too many open files Any suggestions would be greatly appreciated. Thanks, Eric
Hi! Are you *sure* this is all valid UTF8? I dont know how the file command determines this, and if it always is right. Maybe try to play around with iconv to ensure whatever you send to Ferret really is UTF8. Cheers, Jens On 19.05.2008, at 18:00, Eric Schulte wrote:> Hi, > > I am trying to index a number of Spanish language text files, but a > large fraction of the files are generating errors like the > following... > > Error: exception 2 not handled: Error decoding input string. Check > that you have the locale set correctly > > however it looks to me like my locale matches the file type. Running > the file command on the files returns > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text> > > and my locale is > > $ locale > LANG=en_US.UTF-8 > LC_CTYPE="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_PAPER="en_US.UTF-8" > LC_NAME="en_US.UTF-8" > LC_ADDRESS="en_US.UTF-8" > LC_TELEPHONE="en_US.UTF-8" > LC_MEASUREMENT="en_US.UTF-8" > LC_IDENTIFICATION="en_US.UTF-8" > LC_ALL> > > after enough of these errors are generated, I begin to get errors for > having too many open files, and the indexing fails. > > Error: exception 2 not handled: Too many open files > > Any suggestions would be greatly appreciated. > > Thanks, > Eric > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database
Hi Jens, Thanks for the reply! I used iconv (thanks for the pointer, I had no idea this tool existed) and was able to convert all of the articles to and from utf8 without any errors being generated, so I am pretty sure that the input sources are valid utf8. I should mention that I am using an old version of ferret. v.0.9.6 which is the last version to have a pure-ruby implementation. I''m using this version because I have added in some changes which allow me to specify the scoring algorithm used on a per-search basis. I haven''t however made any changes to the indexing portion of the application. I current have an iconv script creating transliterated ASCII copies of all my articles, so I am going to try to index over these. Also, I am thinking of trying to index using Lucene since there is a chance that the older version of ferret is compatible with lucene indexes. If you have any other suggestions I''d love to hear them, but I understand that I can''t expect much help with such an old version. Do you know of a way to specify custom scoring algorithms in the current versions of ferret? Best, Eric On Monday, May 19, at 23:15, Jens Kraemer wrote: > Hi! > > Are you *sure* this is all valid UTF8? I dont know how the file > command determines this, and if it always is right. > Maybe try to play around with iconv to ensure whatever you send to > Ferret really is UTF8. > > Cheers, > Jens > > On 19.05.2008, at 18:00, Eric Schulte wrote: > > > Hi, > > > > I am trying to index a number of Spanish language text files, but a > > large fraction of the files are generating errors like the > > following... > > > > Error: exception 2 not handled: Error decoding input string. Check > > that you have the locale set correctly > > > > however it looks to me like my locale matches the file type. Running > > the file command on the files returns > > > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es > > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text > > > > > > > > and my locale is > > > > $ locale > > LANG=en_US.UTF-8 > > LC_CTYPE="en_US.UTF-8" > > LC_NUMERIC="en_US.UTF-8" > > LC_TIME="en_US.UTF-8" > > LC_COLLATE="en_US.UTF-8" > > LC_MONETARY="en_US.UTF-8" > > LC_MESSAGES="en_US.UTF-8" > > LC_PAPER="en_US.UTF-8" > > LC_NAME="en_US.UTF-8" > > LC_ADDRESS="en_US.UTF-8" > > LC_TELEPHONE="en_US.UTF-8" > > LC_MEASUREMENT="en_US.UTF-8" > > LC_IDENTIFICATION="en_US.UTF-8" > > LC_ALL > > > > > > after enough of these errors are generated, I begin to get errors for > > having too many open files, and the indexing fails. > > > > Error: exception 2 not handled: Too many open files > > > > Any suggestions would be greatly appreciated. > > > > Thanks, > > Eric > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > -- schulte
Hi, So I''ve tried switching to the latest version of Ferret (0.11.06), but I am still getting the following errors. ,---- | Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly | from spanish_indexer.rb:45 | from spanish_indexer.rb:38:in `each'' | from spanish_indexer.rb:38 `---- The articles are recognized as valid utf8 using iconv, and I believe my locale is set properly ,---- | LANG=en_US.UTF-8 | LC_CTYPE="en_US.UTF-8" | LC_NUMERIC="en_US.UTF-8" | LC_TIME="en_US.UTF-8" | LC_COLLATE="en_US.UTF-8" | LC_MONETARY="en_US.UTF-8" | LC_MESSAGES="en_US.UTF-8" | LC_PAPER="en_US.UTF-8" | LC_NAME="en_US.UTF-8" | LC_ADDRESS="en_US.UTF-8" | LC_TELEPHONE="en_US.UTF-8" | LC_MEASUREMENT="en_US.UTF-8" | LC_IDENTIFICATION="en_US.UTF-8" | LC_ALL`---- what''s weird here is that the errors don''t always happen on the same articles, if I try to run indexing three times, printing out the articles that throw this error, I get a different list of articles each time. In fact I just changed my indexing script so that it keeps trying to index failed articles ,---- | # ind is my index | # | # add_arts is a method which takes a list of articles, tries to | # index them, and returns a list of the articles that | # threw errors during indexing | # | puts art_paths.size.to_s + "articles" | missed = add_arts(art_paths, ind) | while missed.size > 0 | missed = add_arts(missed, ind) | puts missed.size | end `---- and I was able to index all of the articles with the following output ,---- | 5843 articles | 34 | 16 | 10 | 9 | 7 | 7 | 6 | 1 | 0 `---- any ideas what could be causing this non-deterministic behavior? Thanks, Eric -- schulte