Charlie
2006-Jul-07 02:45 UTC
[Ferret-talk] How to add Asia token analyzer to ferret simply?
Hi,David Can you give me an example of how to add analyzer to ferret to Asian languages? My web application will have to support multi language search,which means,for example,both Chinese and English will be searched through the form. Currently,I have decided to use the simple token principles,which means that every Chinese character will be a token,although this is not so well in some cases,my database column to be full-text searched include at most tens of UTF-8 characters,therefore i think it can works well. Thanks a lot! David Balmain wrote:> On 7/5/06, Charlie <yingfeng.zhang at gmail.com> wrote: >> Is there any schema of full-text search that support utf-8 especially >> for Asia language such as Chinese,Japanese,etc. >> Ferret/acts_as_ferret can not work when these language key words are >> searched,and also, it is difficult to implement pagination-which need >> both the count of search results and offset. >> Very grateful! > > Hi Charlie, > > Ferret will work fine on Asian Languages. You just need to write your > own Analyzer which matches tokens correctly for the language you are > interested in. Have a look at the RegExpAnalyzer in Ferret. You can > look at test/unit/analysis/ctc_analyzer.rb to see how it works. > > Cheers, > Dave-- Posted via http://www.ruby-forum.com/.
Charlie
2006-Jul-07 02:59 UTC
[Ferret-talk] How to add Asia token analyzer to ferret simply?
And also it is needed to make the new Chinese analyzer work together with the original standard analyzer -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Jul-07 12:15 UTC
[Ferret-talk] How to add Asia token analyzer to ferret simply?
On 7/7/06, Charlie <yingfeng.zhang at gmail.com> wrote:> And also it is needed to make the new Chinese analyzer work together > with the original standard analyzerI answered this on the rails list but just in case; # Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which # defaults to Standard analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new) # Add a special character analyzer for the chinese field or # whatever field it is that has chinese characters. This splits the # data into single characters. analyzer["chinese"] = RegExpAnalyzer.new(/./, false) There you have it. Pretty simple. Cheers, Dave
Charlie
2006-Jul-08 07:05 UTC
[Ferret-talk] How to add Asia token analyzer to ferret simply?
David Balmain wrote:> On 7/7/06, Charlie <yingfeng.zhang at gmail.com> wrote: >> And also it is needed to make the new Chinese analyzer work together >> with the original standard analyzer > > I answered this on the rails list but just in case; > > # Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which > # defaults to Standard > analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new) > > # Add a special character analyzer for the chinese field or > # whatever field it is that has chinese characters. This splits the > # data into single characters. > analyzer["chinese"] = RegExpAnalyzer.new(/./, false)Thank you Dave,I looked up the API and found that PerFieldAnalyzerWrapper is useful for field analyze,especially for the coresponding SQL: select * from students where title like ''%Charlie%'' and location_id = 1, where location_id =1 query can be got through PerFieldAnalyzerWrapper. I have just now downloaded and read the book of Lucene In Action,and in Chapter 4,it tolds that the standardanalyzer will also split the CJK language into tokens although there is no spaces among them,for example:''????'' will be splitted into tokens of ''?'' ''?'' ''?'' ''?'',that is just what I want. But I still can not search any results from ferret. I use the MySQL as the database with all the encoding of UTF-8,and also,all of my rails sources is saved in the form of UTF-8,then when I input the search box of the above characters of ''????'', I will got zero searched results,can you please help with that situation? Very Grateful! Best Regards Charlie -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Jul-08 09:43 UTC
[Ferret-talk] How to add Asia token analyzer to ferret simply?
On 7/8/06, Charlie <yingfeng.zhang at gmail.com> wrote:> David Balmain wrote: > > On 7/7/06, Charlie <yingfeng.zhang at gmail.com> wrote: > >> And also it is needed to make the new Chinese analyzer work together > >> with the original standard analyzer > > > > I answered this on the rails list but just in case; > > > > # Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which > > # defaults to Standard > > analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new) > > > > # Add a special character analyzer for the chinese field or > > # whatever field it is that has chinese characters. This splits the > > # data into single characters. > > analyzer["chinese"] = RegExpAnalyzer.new(/./, false) > > Thank you Dave,I looked up the API and found that > PerFieldAnalyzerWrapper is useful for field analyze,especially for the > coresponding SQL: select * from students where title like ''%Charlie%'' > and location_id = 1, where location_id =1 query can be got through > PerFieldAnalyzerWrapper. > > I have just now downloaded and read the book of Lucene In Action,and in > Chapter 4,it tolds that the standardanalyzer will also split the CJK > language into tokens although there is no spaces among them,for > example:''????'' will be splitted into tokens of ''?'' ''?'' ''?'' ''?'',that is > just what I want. But I still can not search any results from ferret. I > use the MySQL as the database with all the encoding of UTF-8,and > also,all of my rails sources is saved in the form of UTF-8,then when I > input the search box of the above characters of > ''????'', I will got zero searched results,can you please help with that > situation? Very Grateful!Hi Charlie, The StandardAnalyzer in Ferret works a little differently to the StandardAnalyzer in Lucene. That''s why you need to use the RegExpAnalyzer I gave you. analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new) analyzer["chinese"] = RegExpAnalyzer.new(/./, false) You also need to make sure that this is the analyzer that is being used by the query parser. If you are using the Index::Index class it will handle it for you. Try this in irb; $ irb -KU irb(main):001:0> require ''rubygems'' => true irb(main):002:0> require ''ferret'' => true irb(main):003:0> include Ferret::Index => Object irb(main):004:0> include Ferret::Analysis => Object irb(main):005:0> analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new) => #<Ferret::Analysis::PerFieldAnalyzer:0xb7b2332c> irb(main):006:0> analyzer["chinese"] = RegExpAnalyzer.new(/./, false) => #<Ferret::Analysis::RegExpAnalyzer:0xb7c8bdd4> irb(main):007:0> index = Index.new(:analyzer => analyzer) => #<Ferret::Index::Index:0xb7bbda30> irb(main):008:0> index << {:english => "the quick brown fox jumped over the lazy dog", :chinese => ''????''} => #<Ferret::Index::Index:0xb7bbda30> irb(main):009:0> index << {:chinese => "the quick brown fox jumped over the lazy dog", :english => ''????''} => #<Ferret::Index::Index:0xb7bbda30> irb(main):010:0> index.search_each("chinese:?") {|doc, score| puts "found in #{doc}"} found in 0 => 1 irb(main):011:0> index.search_each("english:?") {|doc, score| puts "found in #{doc}"} => 0