I''ve successfully installed ferret and acts_as_ferret and have no problem with utf-8 for accented characters. It returns correct results fot e.g. fran?ais. My problem is with non latin characters (Persian indeed). I have tested different locales with no success both on Debian and Mac. Any idea? (ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6) -- Posted via http://www.ruby-forum.com/.
On 4/8/07, Reza Yeganeh <yeganeh.reza at gmail.com> wrote:> I''ve successfully installed ferret and acts_as_ferret and have no > problem with utf-8 for accented characters. It returns correct results > fot e.g. fran?ais. My problem is with non latin characters (Persian > indeed). I have tested different locales with no success both on Debian > and Mac. Any idea? > (ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)Hi Reza, I''m afraid I have no experience with Persian text. If you send me an example of some text I''ll have a look and see what I can do. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
David Balmain wrote:> I''m afraid I have no experience with Persian text. If you send me an > example of some text I''ll have a look and see what I can do.Hi David, This is not specific to Persian as I tested with more languages (Hebrew, Japanese...). By the way this is a persian sample: ?????? ???? ??? ?????? ???. ??? ??? ????? ?? ?? ?????? ??????. Thanks, Reza -- Posted via http://www.ruby-forum.com/.
On 4/9/07, Reza Yeganeh <yeganeh.reza at gmail.com> wrote:> David Balmain wrote: > > I''m afraid I have no experience with Persian text. If you send me an > > example of some text I''ll have a look and see what I can do. > > Hi David, > This is not specific to Persian as I tested with more languages (Hebrew, > Japanese...). By the way this is a persian sample: > ?????? ???? ??? ?????? ???. ??? ??? ????? ?? ?? ?????? ?????.Hi Reza, Here is my test code; require ''rubygems'' require ''ferret'' text = "?????? ???? ??? ?????? ???. ??? ??? ????? ?? ?? ?????? ?????." include Ferret::Analysis tokenizer = StandardAnalyzer.new.token_stream(:field, text) while token = tokenizer.next puts token end And this is what I got as the output; token["??????":0:12:1] token["????":13:21:1] token["???":22:28:1] token["??????":29:41:1] token["???":42:48:1] token["???":50:56:1] token["???":57:63:1] token["?????":64:74:1] token["??":75:79:1] token["??":80:84:1] token["??????":85:97:1] token["?????":98:108:1] I guess this is probably the same as what you got but I''m not exactly sure what is wrong with it. If you could explain what it should be doing then I may be able to work out what is wrong. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
> tokenizer = StandardAnalyzer.new.token_stream(:field, text)Thanks Dave, but StandardAnalyzer doesn''t work for me for non-latin text (tokenizer returns nil). I tested with edge Ferret and tried different Ferret.locale. Can you guess what''s wrong? ruby 1.8.4 (2005-12-24) [powerpc-darwin8.6.0], powerpc-apple-darwin8-gcc-4.0.1 Best, Reza -- Posted via http://www.ruby-forum.com/.
Phillip Oertel
2007-Apr-21 22:10 UTC
[Ferret-talk] Ferret and non latin characters support
i am seeing the same problem as reza - tokenizer.next returns nil. another sample text = "^?????????????????????, University of Cologne" returns only: token["university":66:76:1] token["cologne":80:87:2] ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-darwin8.8.2] ferret 0.11.4 kind regards, phillip -- Posted via http://www.ruby-forum.com/.
Phillip Oertel
2007-Apr-21 22:17 UTC
[Ferret-talk] Ferret and non latin characters support
same problem on our debian servers :-( * ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-linux] * Linux s15215947 2.6.16-rc6-060319a #1 SMP Sun Mar 19 16:28:15 CET 2006 i686 GNU/Linux kind regards, phillip -- Posted via http://www.ruby-forum.com/.
Julio Cesar Ody
2007-Apr-22 23:28 UTC
[Ferret-talk] Ferret and non latin characters support
Hey Phillip, I''ve been through a similar situation recently, and I think the simplest way to make it work is to use a RegexpAnalyzer that takes every character for a token. Mind this will have a negative impact on the quality of your search results. Try this: __BEGIN__ #!/usr/bin/ruby require ''rubygems'' require ''ferret'' include Ferret analyzer = Analysis::RegExpAnalyzer.new(/./, false) i = Index::Index.new(:analyzer => analyzer) i << { :content => "^?????????????????????, University of Cologne" } puts i.search(''??'') puts i.search(''University) puts i.search(''of'') __END__ On 4/22/07, Phillip Oertel <me at phillipoertel.com> wrote:> i am seeing the same problem as reza - tokenizer.next returns nil. > > another sample > > text = "^?????????????????????, University of Cologne" > > returns only: > token["university":66:76:1] > token["cologne":80:87:2] > > > ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-darwin8.8.2] > ferret 0.11.4 > > kind regards, > phillip > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- Julio C. Ody http://rootshell.be/~julioody
> ... I think the simplest way to make it work is to use a RegexpAnalyzer that takes > every character for a token.David''s code uses StandardAnalyzer. It''s implemented in C and is fast and advanced. I don''t want to re-invent the wheel (e.g. www.example.com, emails, punctuation etc.). PerFieldAnalyzer is not a good solution for me too (I have mixed text). Persian is very similar to English, in punctuations (it has some extra marks), word foundation, and even stems. -- Posted via http://www.ruby-forum.com/.
Julio Cesar Ody
2007-Apr-22 23:58 UTC
[Ferret-talk] Ferret and non latin characters support
That''s why it was mentioned as the simplest way, not the best way performance-wise. It''s worth mentioning I''m using RegExpAnalyzer to index some information in a hundreds of thousands documents sized index. I''m not hitting any roofs in terms of memory usage or performance. StandardAnalyzer relies on spaces to find tokens, also taking stop words, hyphens into consideration, right? Do correct me if I''m wrong. I don''t know how Persian "works", but if you have any expression that''s not space separated, unless you''re fortunate enough that your users queried for it entirely, they won''t get any results back. The best solution for mixed text scenario, as far as I can tell, is to have an analyzer that''s complex enough to find out the language for every character/word, and apply some sort of sub-analyzer for each language it finds. This might require you to perform many passes through the same string. So to sum it up, it''s not a matter of reinventing the wheel. It''s a quick hack that will get you imprecise results sometimes, but will work with mixed text for sure, since your analyzer doesn''t assume any "westernisms" to be there when tokenizing text. On 4/23/07, Reza Yeganeh <yeganeh.reza at gmail.com> wrote:> > ... I think the simplest way to make it work is to use a RegexpAnalyzer that takes > > every character for a token. > > David''s code uses StandardAnalyzer. It''s implemented in C and is fast > and advanced. I don''t want to re-invent the wheel (e.g. www.example.com, > emails, punctuation etc.). PerFieldAnalyzer is not a good solution for > me too (I have mixed text). Persian is very similar to English, in > punctuations (it has some extra marks), word foundation, and even stems. > > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Julio C. Ody http://rootshell.be/~julioody
> So to sum it up, it''s not a matter of reinventing the wheel. It''s a > quick hack that will get you imprecise results sometimes, but will > work with mixed text for sure, since your analyzer doesn''t assume any > "westernisms" to be there when tokenizing text.I think we''re missing the point here. The problem is that David''s code uses StandardAnalyzer and it works for him, not for me and Phillip. I have to write my own Analyzer, Stemfilter and StopFilter for Persian. If StandardAnalyzer (although partially for Persian) works, I won''t have extra overhead of using RegExpAnalyzer for common tokenizing of Persian and Latin context. -- Posted via http://www.ruby-forum.com/.