thr3ads.net - Ferret talk - [Ferret-talk] How to add Asia token analyzer to ferret simply? [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Charlie

2006-Jul-07 02:45 UTC

[Ferret-talk] How to add Asia token analyzer to ferret simply?

Hi,David
Can you give me an example of how to add analyzer to ferret to Asian 
languages?
My web application will have to support multi language search,which 
means,for example,both Chinese and English will be searched through the 
form.
Currently,I have decided to use the simple token principles,which means 
that every Chinese character will be a token,although this is not so 
well in some cases,my database column to be full-text searched include 
at most tens of UTF-8 characters,therefore i think it can works well.
Thanks a lot!

David Balmain wrote:> On 7/5/06, Charlie <yingfeng.zhang at gmail.com> wrote:
>> Is there any schema of full-text search that support utf-8 especially
>> for Asia language such as Chinese,Japanese,etc.
>> Ferret/acts_as_ferret can not work when these language key words are
>> searched,and also, it is difficult to implement pagination-which need
>> both the count of search results and offset.
>> Very grateful!
> 
> Hi Charlie,
> 
> Ferret will work fine on Asian Languages. You just need to write your
> own Analyzer which matches tokens correctly for the language you are
> interested in. Have a look at the RegExpAnalyzer in Ferret. You can
> look at test/unit/analysis/ctc_analyzer.rb to see how it works.
> 
> Cheers,
> Dave

-- 
Posted via http://www.ruby-forum.com/.

Charlie

2006-Jul-07 02:59 UTC

head link

[Ferret-talk] How to add Asia token analyzer to ferret simply?

And also it is needed to make the new Chinese analyzer work together 
with the original standard analyzer

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Jul-07 12:15 UTC

head link

[Ferret-talk] How to add Asia token analyzer to ferret simply?

On 7/7/06, Charlie <yingfeng.zhang at gmail.com>
wrote:> And also it is needed to make the new Chinese analyzer work together
> with the original standard analyzer
I answered this on the rails list but just in case;

   # Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which
   # defaults to Standard
   analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)

   # Add a special character analyzer for the chinese field or
   # whatever field it is that has chinese characters. This splits the
   # data into single characters.
   analyzer["chinese"] = RegExpAnalyzer.new(/./, false)

There you have it. Pretty simple.

Cheers,
Dave

Charlie

2006-Jul-08 07:05 UTC

head link

[Ferret-talk] How to add Asia token analyzer to ferret simply?

David Balmain wrote:> On 7/7/06, Charlie <yingfeng.zhang at gmail.com> wrote:
>> And also it is needed to make the new Chinese analyzer work together
>> with the original standard analyzer
> 
> I answered this on the rails list but just in case;
> 
>    # Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which
>    # defaults to Standard
>    analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
> 
>    # Add a special character analyzer for the chinese field or
>    # whatever field it is that has chinese characters. This splits the
>    # data into single characters.
>    analyzer["chinese"] = RegExpAnalyzer.new(/./, false)
Thank you Dave,I looked up the API and found that 
PerFieldAnalyzerWrapper is useful for field analyze,especially for the 
coresponding SQL: select * from students where title like
''%Charlie%''
and location_id = 1,    where location_id =1 query can be got through 
PerFieldAnalyzerWrapper.

I have just now downloaded and read the book of Lucene In Action,and in 
Chapter 4,it tolds that the standardanalyzer will also split the CJK 
language into tokens although there is no spaces among them,for 
example:''????'' will be splitted into tokens of
''?'' ''?'' ''?''
''?'',that is
just what I want. But I still can not search any results from ferret. I 
use the MySQL as the database with all the encoding of UTF-8,and 
also,all of my rails sources is saved in the form of UTF-8,then when I 
input the search box of the above characters of
''????'', I will got zero searched results,can you please help
with that
situation? Very Grateful!

Best Regards
Charlie

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Jul-08 09:43 UTC

head link

[Ferret-talk] How to add Asia token analyzer to ferret simply?

On 7/8/06, Charlie <yingfeng.zhang at gmail.com>
wrote:> David Balmain wrote:
> > On 7/7/06, Charlie <yingfeng.zhang at gmail.com> wrote:
> >> And also it is needed to make the new Chinese analyzer work
together
> >> with the original standard analyzer
> >
> > I answered this on the rails list but just in case;
> >
> >    # Create a PerFieldAnalyzer (AKA PerFieldAnalyzerWrapper) which
> >    # defaults to Standard
> >    analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
> >
> >    # Add a special character analyzer for the chinese field or
> >    # whatever field it is that has chinese characters. This splits the
> >    # data into single characters.
> >    analyzer["chinese"] = RegExpAnalyzer.new(/./, false)
>
> Thank you Dave,I looked up the API and found that
> PerFieldAnalyzerWrapper is useful for field analyze,especially for the
> coresponding SQL: select * from students where title like
''%Charlie%''
> and location_id = 1,    where location_id =1 query can be got through
> PerFieldAnalyzerWrapper.
>
> I have just now downloaded and read the book of Lucene In Action,and in
> Chapter 4,it tolds that the standardanalyzer will also split the CJK
> language into tokens although there is no spaces among them,for
> example:''????'' will be splitted into tokens of
''?'' ''?'' ''?''
''?'',that is
> just what I want. But I still can not search any results from ferret. I
> use the MySQL as the database with all the encoding of UTF-8,and
> also,all of my rails sources is saved in the form of UTF-8,then when I
> input the search box of the above characters of
> ''????'', I will got zero searched results,can you please
help with that
> situation? Very Grateful!
Hi Charlie,

The StandardAnalyzer in Ferret works a little differently to the
StandardAnalyzer in Lucene. That''s why you need to use the
RegExpAnalyzer I gave you.

  analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
  analyzer["chinese"] = RegExpAnalyzer.new(/./, false)

You also need to make sure that this is the analyzer that is being
used by the query parser. If you are using the Index::Index class it
will handle it for you. Try this in irb;


$ irb -KU
irb(main):001:0> require ''rubygems''
=> true
irb(main):002:0> require ''ferret''
=> true
irb(main):003:0> include Ferret::Index
=> Object
irb(main):004:0> include Ferret::Analysis
=> Object
irb(main):005:0> analyzer = PerFieldAnalyzer.new(StandardAnalyzer.new)
=> #<Ferret::Analysis::PerFieldAnalyzer:0xb7b2332c>
irb(main):006:0> analyzer["chinese"] = RegExpAnalyzer.new(/./,
false)
=> #<Ferret::Analysis::RegExpAnalyzer:0xb7c8bdd4>
irb(main):007:0> index = Index.new(:analyzer => analyzer)
=> #<Ferret::Index::Index:0xb7bbda30>
irb(main):008:0> index << {:english => "the quick brown fox
jumped
over the lazy  dog", :chinese => ''????''}
=> #<Ferret::Index::Index:0xb7bbda30>
irb(main):009:0> index << {:chinese => "the quick brown fox
jumped
over the lazy  dog", :english => ''????''}
=> #<Ferret::Index::Index:0xb7bbda30>
irb(main):010:0> index.search_each("chinese:?") {|doc, score| puts
"found in #{doc}"}
found in 0
=> 1
irb(main):011:0> index.search_each("english:?") {|doc, score| puts
"found in #{doc}"}
=> 0

Possibly Parallel Threads

Search for more possibly parallel threads

Ferret talk - Jul 2006 - How to add Asia token analyzer to ferret simply?

[Ferret-talk] How to add Asia token analyzer to ferret simply?

[Ferret-talk] How to add Asia token analyzer to ferret simply?

[Ferret-talk] How to add Asia token analyzer to ferret simply?

[Ferret-talk] How to add Asia token analyzer to ferret simply?

[Ferret-talk] How to add Asia token analyzer to ferret simply?

Possibly Parallel Threads