Ryan King
2006-Aug-21 21:54 UTC
[Ferret-talk] indexing multiple languages with acts_as_ferret
I have an applicaiton where I''m indexing content in a number of languages, all encoded in UTF8. I would think that having my locale set to en_US.utf8 would be sufficient to make this work in ferret/acts_as_ferret, but I keep running into problems, even with english text. What have others done to cope with these encoding difficulties? -ryan
Benjamin Krause
2006-Aug-22 07:50 UTC
[Ferret-talk] indexing multiple languages with acts_as_ferret
> I have an applicaiton where I''m indexing content in a number of > languages, all encoded in UTF8. I would think that having my locale > set to en_US.utf8 would be sufficient to make this work in > ferret/acts_as_ferret, but I keep running into problems, even with > english text. > > What have others done to cope with these encoding difficulties?hi.. i''m using ferret (not acts_as_ferret, but this shouldn''t matter) to index content in german, english, polish, japanese, chinese, french .. all in UTF8 and i don''t had any problem with it yet :-) (using ferret 0.9.4 and 0.9.5) Ben
David Balmain
2006-Aug-22 18:40 UTC
[Ferret-talk] indexing multiple languages with acts_as_ferret
On 8/22/06, Ryan King <ryansking at gmail.com> wrote:> I have an applicaiton where I''m indexing content in a number of > languages, all encoded in UTF8. I would think that having my locale > set to en_US.utf8 would be sufficient to make this work in > ferret/acts_as_ferret, but I keep running into problems, even with > english text. > > What have others done to cope with these encoding difficulties? > > -ryanHi Ryan, Usually these problems stem from adding data that you think is UTF-8 but is actually ISO-8859-1. The best solution is to make sure all data added to Ferret really is UTF-8. This may require some data conversion. See the Iconv class in the standard library. Ferret 0.10.0 is a little more lenient on encoding errors, ie it handles them silently. It is up to you to make sure it gets the correct encoding. If you pass in ISO-8859-1 when the locale is set to handle UTF-8, all non-ascii characters will be treated as letters which is often (but not always) what you want. Cheers, Dave
> > hi.. > > i''m using ferret (not acts_as_ferret, but this shouldn''t matter) to > index > content in german, english, polish, japanese, chinese, french .. all in > UTF8 and i don''t had any problem with it yet :-) (using ferret 0.9.4 and > 0.9.5) > > BenHi,Ben Have u modified any code of ferret? I have also used ferret to index CJK(Chinese,Korea,Japanese) languages,all of which are encoded in utf-8,but i can not get them searched correctly Frank -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Sep-19 00:41 UTC
[Ferret-talk] indexing multiple languages with acts_as_ferret
On 9/18/06, Frank <frankfan at 163.com> wrote:> > > > hi.. > > > > i''m using ferret (not acts_as_ferret, but this shouldn''t matter) to > > index > > content in german, english, polish, japanese, chinese, french .. all in > > UTF8 and i don''t had any problem with it yet :-) (using ferret 0.9.4 and > > 0.9.5) > > > > Ben > > Hi,Ben > Have u modified any code of ferret? I have also used ferret to index > CJK(Chinese,Korea,Japanese) languages,all of which are encoded in > utf-8,but i can not get them searched correctly > > > > FrankHi Frank, Someone else had this problem earlier. I think the Chinese charecters were being escaped by the browser. Are you running your searches through a browser? If so, you may need to call CGI.unescape on the query string. At any rate, the first thing I would check is the actual query string that you are passing to Ferret. Make sure it looks like you would expect it to and it really is UTF-8, not some other chinese character encoding. cheers, Dave