Chris Gansen
2006-Nov-02 22:57 UTC
[Ferret-talk] Indexing and searching across multiple locales
Hi - I''m currently investigating support for Ferret and content that spans multiple locales. I am particularly interested in using stemming and fuzzy searches (e.g. with slop factor) across multiple locales. So far I''ve followed the online docs for implementing a Stemming Analyzer, and it is working for English terms just fine. I''ve also written a method to import data from the legacy XML files and save as ActiveRecord objects (using AAF). However, I''m not certain the the locale-switching is working properly: doc = Document.import_from_xml(filename) Ferret::locale = doc.locale_id # locale_id is "en.UTF-8" or "fr.UTF-8" for example doc.save What''s the best way to handle the import of data, where locale is changing from document to document? What other considerations should I keep in mind when using Ferret across multiple locales? Thanks for any tips! --chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20061102/752792e0/attachment.html
Andreas Korth
2006-Nov-03 21:21 UTC
[Ferret-talk] Indexing and searching across multiple locales
These are very good questions indeed. I''m afraid I don''t have the answers but I''d like to add some questions and remarks of my own and hope someone will eventually provide some insight. On 02.11.2006, at 23:57, Chris Gansen wrote:> I''m currently investigating support for Ferret and content that > spans multiple locales. I am particularly interested in using > stemming and fuzzy searches (e.g. with slop factor) across multiple > locales. > > So far I''ve followed the online docs for implementing a Stemming > Analyzer, and it is working for English terms just fine. I''ve also > written a method to import data from the legacy XML files and save > as ActiveRecord objects (using AAF). However, I''m not certain the > the locale-switching is working properly: > > doc = Document.import_from_xml(filename) > Ferret::locale = doc.locale_id # locale_id is "en.UTF-8" or > "fr.UTF-8" for example > doc.saveI don''t think setting the locale has any effect on already created StemFilters and StopFilters, so the above code doesn''t change anything. According to the docs the locale setting doesn''t even affect the default stop words or stemming algorithms used when creating a new StopFilter or StemFilter, respectively. The default language is English in both cases, no matter what the current locale is. This leads me to the ultimate question: What is the locale setting good for anyway? Could it be that only the character encoding portion of the locale string is actually relevant?> What''s the best way to handle the import of data, where locale is > changing from document to document? What other considerations > should I keep in mind when using Ferret across multiple locales?From what I have observed, you''ll need to create different Analyzers with a StemFilter and StopFilter explicitly created for the respective locale. I don''t know about French but the German stemming algorithm is very inaccurate. Stemming algorithms for the English language are probably easier to implement, since German and French have more complex rules and lots of exceptions. But even the English stemming algorithm seems to be entirely rule-based and thus fails on irregular verbs. I think it might be a good idea to provide a facility to extend the stemmer, very much like the inflection rules can be extended in Rails. Cheers, Andy
Chris Gansen
2006-Nov-03 23:18 UTC
[Ferret-talk] Indexing and searching across multiple locales
On 11/3/06, Andreas Korth <andreas.korth at gmx.net> wrote:> > These are very good questions indeed. I''m afraid I don''t have the > answers but I''d like to add some questions and remarks of my own and > hope someone will eventually provide some insight. >Thanks for the response. I guess my real question is: how have other people handled indexing data across many locales? What works and what doesn''t? From my initial work, the basic indexing works across languages; however, it''s the "fun" stuff like stemming and fuzzy searches that I am particularly interested in. Any pointers are appreciated. --chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20061103/cf0d95bc/attachment.html
Benjamin Krause
2006-Nov-04 15:52 UTC
[Ferret-talk] Indexing and searching across multiple locales
Chris Gansen schrieb:> Thanks for the response. I guess my real question is: how have other > people handled indexing data across many locales? What works and what > doesn''t? From my initial work, the basic indexing works across > languages; however, it''s the "fun" stuff like stemming and fuzzy > searches that I am particularly interested in.Hey Chris, i store content in different languages in different fields.. i have an object, that has content in de/pl/en and i got a field content_de, content_en and content_pl for that object. now i can implement a per_field_analyzer to stem each field in its locale. this might not exactly match your example, as this is really one db-object with different translations attached to it, not different objects in different languages. Ben