thr3ads.net - Ferret talk - [Ferret-talk] strange matching: maybe a multilanguage collation problem? [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Francis Hwang

2006-Sep-21 20:18 UTC

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

Hi,

We''re using Ferret in a slightly unorthodox way: We''re
indexing a
large (>100,000) list of names of places all around the world. Mostly  
we''re quite happy with it, and have been able to graft on our own  
particular required functionality with just a little tweaking.

There''s one strange problem, though: We''ve got a place in
Cyprus
called "Gazima\304\237usa" (that \304\237 is a multibyte character in
UTF-8), and it matches a search for "usa". We''d rather it not
match.
I don''t know that much about Ferret or about this sort of indexing in  
general, but is this because Ferret views \304\237 as a word break,  
and splits the name into two words? If so, is there a way you''d  
recommend to get around this -- keeping in mind that we''ve got names  
in romanized forms of many different languages?

Thanks in advance,

Francis

David Balmain

2006-Sep-22 02:20 UTC

head link

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

On 9/22/06, Francis Hwang <sera at fhwang.net>
wrote:> Hi,
>
> We''re using Ferret in a slightly unorthodox way: We''re
indexing a
> large (>100,000) list of names of places all around the world. Mostly
> we''re quite happy with it, and have been able to graft on our own
> particular required functionality with just a little tweaking.
>
> There''s one strange problem, though: We''ve got a place in
Cyprus
> called "Gazima\304\237usa" (that \304\237 is a multibyte
character in
> UTF-8), and it matches a search for "usa". We''d rather
it not match.
> I don''t know that much about Ferret or about this sort of indexing
in
> general, but is this because Ferret views \304\237 as a word break,
> and splits the name into two words? If so, is there a way you''d
> recommend to get around this -- keeping in mind that we''ve got
names
> in romanized forms of many different languages?
>
> Thanks in advance,
>
> Francis
Hi Francis,

It is because Ferret sees that as a word break. This must be either
because you are using an ASCII Analzyer (which I doubt) or your locale
isn''t set to handle UTF-8. You can set your locale like this:

    ENV[''LANG''] = ''en_US.utf8''

Or use whatever locale your data is stored as. Let me know if that helps.

Cheers,
Dave

PS: if not all your data is UTF-8 you may need to convert it. In that
case you should check out the Ruby''s iconv standard library.

Francis Hwang

2006-Sep-22 21:30 UTC

head link

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

On Sep 21, 2006, at 10:20 PM, David Balmain wrote:
> On 9/22/06, Francis Hwang <sera at fhwang.net> wrote:
>> Hi,
>>
>> We''re using Ferret in a slightly unorthodox way:
We''re indexing a
>> large (>100,000) list of names of places all around the world.
Mostly
>> we''re quite happy with it, and have been able to graft on our
own
>> particular required functionality with just a little tweaking.
>>
>> There''s one strange problem, though: We''ve got a
place in Cyprus
>> called "Gazima\304\237usa" (that \304\237 is a multibyte
character in
>> UTF-8), and it matches a search for "usa". We''d
rather it not match.
>> I don''t know that much about Ferret or about this sort of
indexing in
>> general, but is this because Ferret views \304\237 as a word break,
>> and splits the name into two words? If so, is there a way
you''d
>> recommend to get around this -- keeping in mind that we''ve got
names
>> in romanized forms of many different languages?
>>
>> Thanks in advance,
>>
>> Francis
>
> Hi Francis,
>
> It is because Ferret sees that as a word break. This must be either
> because you are using an ASCII Analzyer (which I doubt) or your locale
> isn''t set to handle UTF-8. You can set your locale like this:
>
>     ENV[''LANG''] = ''en_US.utf8''
>
> Or use whatever locale your data is stored as. Let me know if that  
> helps.
>
> Cheers,
> Dave
>
> PS: if not all your data is UTF-8 you may need to convert it. In that
> case you should check out the Ruby''s iconv standard library.
I tried that and it made no difference. The data is in UTF-8 already.  
And as far as the analyzer, we''re just using the StandardAnalyzer. (I  
actually don''t know much about what all the different analyzers do,  
at any rate.) Any other ideas?

Francis

David Balmain

2006-Sep-23 04:56 UTC

head link

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

On 9/23/06, Francis Hwang <sera at fhwang.net>
wrote:> On Sep 21, 2006, at 10:20 PM, David Balmain wrote:
>
> > On 9/22/06, Francis Hwang <sera at fhwang.net> wrote:
> >> Hi,
> >>
> >> We''re using Ferret in a slightly unorthodox way:
We''re indexing a
> >> large (>100,000) list of names of places all around the world.
Mostly
> >> we''re quite happy with it, and have been able to graft on
our own
> >> particular required functionality with just a little tweaking.
> >>
> >> There''s one strange problem, though: We''ve got a
place in Cyprus
> >> called "Gazima\304\237usa" (that \304\237 is a multibyte
character in
> >> UTF-8), and it matches a search for "usa". We''d
rather it not match.
> >> I don''t know that much about Ferret or about this sort of
indexing in
> >> general, but is this because Ferret views \304\237 as a word
break,
> >> and splits the name into two words? If so, is there a way
you''d
> >> recommend to get around this -- keeping in mind that
we''ve got names
> >> in romanized forms of many different languages?
> >>
> >> Thanks in advance,
> >>
> >> Francis
> >
> > Hi Francis,
> >
> > It is because Ferret sees that as a word break. This must be either
> > because you are using an ASCII Analzyer (which I doubt) or your locale
> > isn''t set to handle UTF-8. You can set your locale like this:
> >
> >     ENV[''LANG''] = ''en_US.utf8''
> >
> > Or use whatever locale your data is stored as. Let me know if that
> > helps.
> >
> > Cheers,
> > Dave
> >
> > PS: if not all your data is UTF-8 you may need to convert it. In that
> > case you should check out the Ruby''s iconv standard library.
>
> I tried that and it made no difference. The data is in UTF-8 already.
> And as far as the analyzer, we''re just using the StandardAnalyzer.
(I
> actually don''t know much about what all the different analyzers
do,
> at any rate.) Any other ideas?
>
> Francis
Hi Francis,

I don''t really have any other ideas. Did you re-index the data after
you set ENV["LANG"]? Could you try this code and tell me what you get;

    require ''rubygems''
    require ''ferret''
    p Ferret::VERSION # 0.10.6
    p Ferret::locale # "en_US.UTF-8"

    index = Ferret::I.new()

    index << {:place => "Gazima\304\237usa"}
    index << {:place => "U.S.A."}
    puts "Search: USA"
    index.search_each("USA") {|id, score| puts index[id][:place]}
    # Search: USA
    # U.S.A.

    puts "Search: Gazima\304\237usa"
    index.search_each("Gazima\304\237usa") {|id, score| puts
index[id][:place]}
    # Search: Gazima?usa
    # Gazima?usa

Cheers,
Dave

Francis Hwang

2006-Sep-28 14:54 UTC

head link

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

On Sep 23, 2006, at 12:56 AM, David Balmain wrote:> I don''t really have any other ideas. Did you re-index the data
after
> you set ENV["LANG"]? Could you try this code and tell me what you
get;
>
>     require ''rubygems''
>     require ''ferret''
>     p Ferret::VERSION # 0.10.6
>     p Ferret::locale # "en_US.UTF-8"
>
>     index = Ferret::I.new()
>
>     index << {:place => "Gazima\304\237usa"}
>     index << {:place => "U.S.A."}
>     puts "Search: USA"
>     index.search_each("USA") {|id, score| puts index[id][:place]}
>     # Search: USA
>     # U.S.A.
>
>     puts "Search: Gazima\304\237usa"
>     index.search_each("Gazima\304\237usa") {|id, score| puts
index
> [id][:place]}
>     # Search: Gazima?usa
>     # Gazima?usa
In the end, setting ENV[''LANG''] didn''t seem to have
an effect, but
setting Ferret::locale directly seems to work:

Ferret::locale = ''en_US.UTF-8''

Thanks!

Francis

Maybe Matching Threads

Search for more maybe matching threads

Ferret talk - Sep 2006 - strange matching: maybe a multilanguage collation problem?

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

[Ferret-talk] strange matching: maybe a multilanguage collation problem?

Maybe Matching Threads