Francis Hwang
2006-Sep-21 20:18 UTC
[Ferret-talk] strange matching: maybe a multilanguage collation problem?
Hi, We''re using Ferret in a slightly unorthodox way: We''re indexing a large (>100,000) list of names of places all around the world. Mostly we''re quite happy with it, and have been able to graft on our own particular required functionality with just a little tweaking. There''s one strange problem, though: We''ve got a place in Cyprus called "Gazima\304\237usa" (that \304\237 is a multibyte character in UTF-8), and it matches a search for "usa". We''d rather it not match. I don''t know that much about Ferret or about this sort of indexing in general, but is this because Ferret views \304\237 as a word break, and splits the name into two words? If so, is there a way you''d recommend to get around this -- keeping in mind that we''ve got names in romanized forms of many different languages? Thanks in advance, Francis
David Balmain
2006-Sep-22 02:20 UTC
[Ferret-talk] strange matching: maybe a multilanguage collation problem?
On 9/22/06, Francis Hwang <sera at fhwang.net> wrote:> Hi, > > We''re using Ferret in a slightly unorthodox way: We''re indexing a > large (>100,000) list of names of places all around the world. Mostly > we''re quite happy with it, and have been able to graft on our own > particular required functionality with just a little tweaking. > > There''s one strange problem, though: We''ve got a place in Cyprus > called "Gazima\304\237usa" (that \304\237 is a multibyte character in > UTF-8), and it matches a search for "usa". We''d rather it not match. > I don''t know that much about Ferret or about this sort of indexing in > general, but is this because Ferret views \304\237 as a word break, > and splits the name into two words? If so, is there a way you''d > recommend to get around this -- keeping in mind that we''ve got names > in romanized forms of many different languages? > > Thanks in advance, > > FrancisHi Francis, It is because Ferret sees that as a word break. This must be either because you are using an ASCII Analzyer (which I doubt) or your locale isn''t set to handle UTF-8. You can set your locale like this: ENV[''LANG''] = ''en_US.utf8'' Or use whatever locale your data is stored as. Let me know if that helps. Cheers, Dave PS: if not all your data is UTF-8 you may need to convert it. In that case you should check out the Ruby''s iconv standard library.
Francis Hwang
2006-Sep-22 21:30 UTC
[Ferret-talk] strange matching: maybe a multilanguage collation problem?
On Sep 21, 2006, at 10:20 PM, David Balmain wrote:> On 9/22/06, Francis Hwang <sera at fhwang.net> wrote: >> Hi, >> >> We''re using Ferret in a slightly unorthodox way: We''re indexing a >> large (>100,000) list of names of places all around the world. Mostly >> we''re quite happy with it, and have been able to graft on our own >> particular required functionality with just a little tweaking. >> >> There''s one strange problem, though: We''ve got a place in Cyprus >> called "Gazima\304\237usa" (that \304\237 is a multibyte character in >> UTF-8), and it matches a search for "usa". We''d rather it not match. >> I don''t know that much about Ferret or about this sort of indexing in >> general, but is this because Ferret views \304\237 as a word break, >> and splits the name into two words? If so, is there a way you''d >> recommend to get around this -- keeping in mind that we''ve got names >> in romanized forms of many different languages? >> >> Thanks in advance, >> >> Francis > > Hi Francis, > > It is because Ferret sees that as a word break. This must be either > because you are using an ASCII Analzyer (which I doubt) or your locale > isn''t set to handle UTF-8. You can set your locale like this: > > ENV[''LANG''] = ''en_US.utf8'' > > Or use whatever locale your data is stored as. Let me know if that > helps. > > Cheers, > Dave > > PS: if not all your data is UTF-8 you may need to convert it. In that > case you should check out the Ruby''s iconv standard library.I tried that and it made no difference. The data is in UTF-8 already. And as far as the analyzer, we''re just using the StandardAnalyzer. (I actually don''t know much about what all the different analyzers do, at any rate.) Any other ideas? Francis
David Balmain
2006-Sep-23 04:56 UTC
[Ferret-talk] strange matching: maybe a multilanguage collation problem?
On 9/23/06, Francis Hwang <sera at fhwang.net> wrote:> On Sep 21, 2006, at 10:20 PM, David Balmain wrote: > > > On 9/22/06, Francis Hwang <sera at fhwang.net> wrote: > >> Hi, > >> > >> We''re using Ferret in a slightly unorthodox way: We''re indexing a > >> large (>100,000) list of names of places all around the world. Mostly > >> we''re quite happy with it, and have been able to graft on our own > >> particular required functionality with just a little tweaking. > >> > >> There''s one strange problem, though: We''ve got a place in Cyprus > >> called "Gazima\304\237usa" (that \304\237 is a multibyte character in > >> UTF-8), and it matches a search for "usa". We''d rather it not match. > >> I don''t know that much about Ferret or about this sort of indexing in > >> general, but is this because Ferret views \304\237 as a word break, > >> and splits the name into two words? If so, is there a way you''d > >> recommend to get around this -- keeping in mind that we''ve got names > >> in romanized forms of many different languages? > >> > >> Thanks in advance, > >> > >> Francis > > > > Hi Francis, > > > > It is because Ferret sees that as a word break. This must be either > > because you are using an ASCII Analzyer (which I doubt) or your locale > > isn''t set to handle UTF-8. You can set your locale like this: > > > > ENV[''LANG''] = ''en_US.utf8'' > > > > Or use whatever locale your data is stored as. Let me know if that > > helps. > > > > Cheers, > > Dave > > > > PS: if not all your data is UTF-8 you may need to convert it. In that > > case you should check out the Ruby''s iconv standard library. > > I tried that and it made no difference. The data is in UTF-8 already. > And as far as the analyzer, we''re just using the StandardAnalyzer. (I > actually don''t know much about what all the different analyzers do, > at any rate.) Any other ideas? > > FrancisHi Francis, I don''t really have any other ideas. Did you re-index the data after you set ENV["LANG"]? Could you try this code and tell me what you get; require ''rubygems'' require ''ferret'' p Ferret::VERSION # 0.10.6 p Ferret::locale # "en_US.UTF-8" index = Ferret::I.new() index << {:place => "Gazima\304\237usa"} index << {:place => "U.S.A."} puts "Search: USA" index.search_each("USA") {|id, score| puts index[id][:place]} # Search: USA # U.S.A. puts "Search: Gazima\304\237usa" index.search_each("Gazima\304\237usa") {|id, score| puts index[id][:place]} # Search: Gazima?usa # Gazima?usa Cheers, Dave
Francis Hwang
2006-Sep-28 14:54 UTC
[Ferret-talk] strange matching: maybe a multilanguage collation problem?
On Sep 23, 2006, at 12:56 AM, David Balmain wrote:> I don''t really have any other ideas. Did you re-index the data after > you set ENV["LANG"]? Could you try this code and tell me what you get; > > require ''rubygems'' > require ''ferret'' > p Ferret::VERSION # 0.10.6 > p Ferret::locale # "en_US.UTF-8" > > index = Ferret::I.new() > > index << {:place => "Gazima\304\237usa"} > index << {:place => "U.S.A."} > puts "Search: USA" > index.search_each("USA") {|id, score| puts index[id][:place]} > # Search: USA > # U.S.A. > > puts "Search: Gazima\304\237usa" > index.search_each("Gazima\304\237usa") {|id, score| puts index > [id][:place]} > # Search: Gazima?usa > # Gazima?usaIn the end, setting ENV[''LANG''] didn''t seem to have an effect, but setting Ferret::locale directly seems to work: Ferret::locale = ''en_US.UTF-8'' Thanks! Francis