thr3ads.net - Ferret talk - [Ferret-talk] Which analyzer to use [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Kent Sibilev

2006-Sep-06 05:14 UTC

[Ferret-talk] Which analyzer to use

Lucene''s standard analyzer splits words separater with underscores. 
Ferret doesn''t do this. For example, if I create an index with only 
document ''test_case'' and search for ''case''
it doesn''t find anything.
Lucene on the other hand finds it. The same story goes for words 
separated by colons.

Which analyzer should I use to emulate Lucene''s StandardAnalyzer 
behavior?

Thanks.
Kent

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Sep-06 06:55 UTC

head link

[Ferret-talk] Which analyzer to use

On 9/6/06, Kent Sibilev <ksibilev at yahoo.com>
wrote:> Lucene''s standard analyzer splits words separater with
underscores.
> Ferret doesn''t do this. For example, if I create an index with
only
> document ''test_case'' and search for
''case'' it doesn''t find anything.
> Lucene on the other hand finds it. The same story goes for words
> separated by colons.
>
> Which analyzer should I use to emulate Lucene''s StandardAnalyzer
> behavior?
>
> Thanks.
> Kent
Hi Kent,

No analyzer currently emulates Lucene''s StandardAnalyzer exactly.
You''d have to port it to Ruby which shouldn''t be too hard if
you know
how to use racc. But is sounds to me like you don''t need anything so
complex. If you are indexing code you might want to try using the
AsciiLetterAnalyzer. Or you could use the RegExpAnalyzer and describe
your tokens with a Ruby RegExp. Something like this;

    include Ferret
    include Ferret::Analysis
    index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/))

    # or if you want case sensitive searches;
    index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/, false))

Hope that helps,
Dave

Kent Sibilev

2006-Sep-06 12:01 UTC

head link

[Ferret-talk] Which analyzer to use

David Balmain wrote:> On 9/6/06, Kent Sibilev <ksibilev at yahoo.com> wrote:
> No analyzer currently emulates Lucene''s StandardAnalyzer exactly.
> You''d have to port it to Ruby which shouldn''t be too hard
if you know
> how to use racc. But is sounds to me like you don''t need anything
so
> complex. If you are indexing code you might want to try using the
> AsciiLetterAnalyzer. 
No, it doesn''t do what I want. Looking at the code I''m
slightly
confused. The criteria is that if isalpha returns 0 then we reached the 
end of a token. Does it mean that ''_'' character is considered 
alphanumeric?
> Or you could use the RegExpAnalyzer and describe
> your tokens with a Ruby RegExp. Something like this;
> 
>     include Ferret
>     include Ferret::Analysis
>     index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/))
> 
>     # or if you want case sensitive searches;
>     index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/, false))
> 
It would be great if this code worked, but it segfaulted on me. I''ve 
glanced at the code and noticed that for this type of stream

typedef struct RegExpTokenStream {
    CachedTokenStream super;
    VALUE rtext;
    VALUE regex;
    VALUE proc;
    int   curr_ind;
} RegExpTokenStream;


 you initialize tree VALUE objects but never mark them for garbage 
collector.  Eventually they are being freed behind my back. What you 
should do is to keep the type of the stream in TokenStream structure and 
rework frt_ts_mark method.

Hope that helps,
Kent

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Sep-06 12:21 UTC

head link

[Ferret-talk] Which analyzer to use

On 9/6/06, Kent Sibilev <ksibilev at yahoo.com>
wrote:> David Balmain wrote:
> > On 9/6/06, Kent Sibilev <ksibilev at yahoo.com> wrote:
> > No analyzer currently emulates Lucene''s StandardAnalyzer
exactly.
> > You''d have to port it to Ruby which shouldn''t be too
hard if you know
> > how to use racc. But is sounds to me like you don''t need
anything so
> > complex. If you are indexing code you might want to try using the
> > AsciiLetterAnalyzer.
>
> No, it doesn''t do what I want. Looking at the code I''m
slightly
> confused. The criteria is that if isalpha returns 0 then we reached the
> end of a token. Does it mean that ''_'' character is
considered
> alphanumeric?
irb(main):001:0> require ''rubygems''
irb(main):002:0> require ''ferret''
irb(main):004:0> i = Ferret::I.new(:analyzer =>
Ferret::Analysis::AsciiLetterAnalyzer.new)
irb(main):005:0> i << "test_case"
irb(main):006:0> i.search("case")
=> #<struct Ferret::Search::TopDocs total_hits=1, hits=[#<struct
Ferret::Search::Hit doc=0, score=0.191783010959625>],
max_score=0.191783010959625>
irb(main):007:0>

So no, ''_'' is not considered alphanumeric (or in this case
alpha, as
AsciiLetterAnalyzer won''t match numbers)
> > Or you could use the RegExpAnalyzer and describe
> > your tokens with a Ruby RegExp. Something like this;
> >
> >     include Ferret
> >     include Ferret::Analysis
> >     index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/))
> >
> >     # or if you want case sensitive searches;
> >     index = I.new(:analyzer => RegExpAnalyzer.new(/[A-Za-z0-9]/,
false))
> >
>
> It would be great if this code worked, but it segfaulted on me.
I''ve
> glanced at the code and noticed that for this type of stream
>
> typedef struct RegExpTokenStream {
>     CachedTokenStream super;
>     VALUE rtext;
>     VALUE regex;
>     VALUE proc;
>     int   curr_ind;
> } RegExpTokenStream;
>
>
>  you initialize tree VALUE objects but never mark them for garbage
> collector.  Eventually they are being freed behind my back. What you
> should do is to keep the type of the stream in TokenStream structure and
> rework frt_ts_mark method.
>
> Hope that helps,
> Kent
Actually, frt_rets_mark already marks the three VALUE objects
correctly. What would really help would be if you could give me an
example script that segfaults. If you can do this I''ll fix it and get
a new gem out as soon as possible.

Cheers,
Dave

David Balmain

2006-Sep-06 12:30 UTC

head link

[Ferret-talk] Which analyzer to use

On 9/6/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> correctly. What would really help would be if you could give me an
> example script that segfaults. If you can do this I''ll fix it and
get
> a new gem out as soon as possible.
Actually, hold on that, I think I''ve found the problem.

David Balmain

2006-Sep-06 13:26 UTC

head link

[Ferret-talk] Which analyzer to use

On 9/6/06, David Balmain <dbalmain.ml at gmail.com>
wrote:> On 9/6/06, David Balmain <dbalmain.ml at gmail.com> wrote:
> > correctly. What would really help would be if you could give me an
> > example script that segfaults. If you can do this I''ll fix it
and get
> > a new gem out as soon as possible.
>
> Actually, hold on that, I think I''ve found the problem.
Hi Kent,

I''ve put in a fix which I think should fix your segfault.
Unfortunately I can''t seem to replicate the bug here to test it. Even
calling GC.start doesn''t seem to collect any of the three VALUES in
RegExpTokenStream. I''ve had problems like this before when trying to
test an implemention of a weak-key Hash. I really need to look into
how the Ruby garbage collector works but it never seems to work
predictable for me.

Anyway, I was hoping you could help me out, either by testing your
code against the latest version of Ferret in subversion or sending me
a short (or long, I don''t really care) script which causes the
problem. If it''ll make it any easier I can email you a gem of the
current working version Ferret.

Cheers,
Dave

Kent Sibilev

2006-Sep-06 14:23 UTC

head link

[Ferret-talk] Which analyzer to use

David Balmain wrote:> On 9/6/06, Kent Sibilev <ksibilev at yahoo.com> wrote:
>> end of a token. Does it mean that ''_'' character is
considered
>> alphanumeric?
> 
> irb(main):001:0> require ''rubygems''
> irb(main):002:0> require ''ferret''
> irb(main):004:0> i = Ferret::I.new(:analyzer =>
> Ferret::Analysis::AsciiLetterAnalyzer.new)
> irb(main):005:0> i << "test_case"
> irb(main):006:0> i.search("case")
> => #<struct Ferret::Search::TopDocs total_hits=1, hits=[#<struct
> Ferret::Search::Hit doc=0, score=0.191783010959625>],
> max_score=0.191783010959625>
> irb(main):007:0>
> 
> So no, ''_'' is not considered alphanumeric (or in this
case alpha, as
> AsciiLetterAnalyzer won''t match numbers)
> 
Yes. It seems to work correctly, but I''ve noticed that
index.search_each
doesn''t return more that 10 documents. Is there an option to change it?

>>
>>
>>
>>  you initialize tree VALUE objects but never mark them for garbage
>> collector.  Eventually they are being freed behind my back. What you
>> should do is to keep the type of the stream in TokenStream structure
and
>> rework frt_ts_mark method.
>>
>> Hope that helps,
>> Kent
> 
> Actually, frt_rets_mark already marks the three VALUE objects
> correctly. What would really help would be if you could give me an
> example script that segfaults. If you can do this I''ll fix it and
get
> a new gem out as soon as possible.
> 
I guess I didn''t look carefully at the code.

-- 
Posted via http://www.ruby-forum.com/.

Kent Sibilev

2006-Sep-06 14:24 UTC

head link

[Ferret-talk] Which analyzer to use

David Balmain wrote:> On 9/6/06, David Balmain <dbalmain.ml at gmail.com> wrote:
>> On 9/6/06, David Balmain <dbalmain.ml at gmail.com> wrote:
>> > correctly. What would really help would be if you could give me an
>> > example script that segfaults. If you can do this I''ll
fix it and get
>> > a new gem out as soon as possible.
>>
>> Actually, hold on that, I think I''ve found the problem.
> 
> Hi Kent,
> 
> I''ve put in a fix which I think should fix your segfault.
> Unfortunately I can''t seem to replicate the bug here to test it.
Even
> calling GC.start doesn''t seem to collect any of the three VALUES
in
> RegExpTokenStream. I''ve had problems like this before when trying
to
> test an implemention of a weak-key Hash. I really need to look into
> how the Ruby garbage collector works but it never seems to work
> predictable for me.
> 
> Anyway, I was hoping you could help me out, either by testing your
> code against the latest version of Ferret in subversion or sending me
> a short (or long, I don''t really care) script which causes the
> problem. If it''ll make it any easier I can email you a gem of the
> current working version Ferret.
Can you send it to my email at ksibilev at yahoo dot com?

Thanks.

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Sep-06 15:09 UTC

head link

[Ferret-talk] Which analyzer to use

On 9/6/06, Kent Sibilev <ksibilev at yahoo.com>
wrote:> Yes. It seems to work correctly, but I''ve noticed that
index.search_each
> doesn''t return more that 10 documents. Is there an option to
change it?
Yep, :limit. The documentation is wrong in 0.10.2. It will be
corrected in the next version.

    index.search_each(query, :limit => 20) #...

Or you can get all results like this;

    index.search_each(query, :limit => :all) #...

If you are paging through results, use :offset;

    index.search_each(query, :limit => 20, :offset => 40) #...

Cheers,
Dave

Kent Sibilev

2006-Sep-06 15:24 UTC

head link

[Ferret-talk] Which analyzer to use

On 9/6/06, David Balmain <dbalmain.ml at gmail.com>
wrote:>
> On 9/6/06, Kent Sibilev <ksibilev at yahoo.com> wrote:
> > Yes. It seems to work correctly, but I''ve noticed that
index.search_each
> > doesn''t return more that 10 documents. Is there an option to
change it?
>
> Yep, :limit. The documentation is wrong in 0.10.2. It will be
> corrected in the next version.
>
>     index.search_each(query, :limit => 20) #...
>
> Or you can get all results like this;
>
>     index.search_each(query, :limit => :all) #...
>
> If you are paging through results, use :offset;
>
>     index.search_each(query, :limit => 20, :offset => 40) #...
>
>Perfect, but I think it would be less confusing it you set :all as a default
value for :limit.

BTW, Ferrect is quite faster than Lucene. I haven''t checked all query
types,
but ones that I use most of the time are fast. Congrats!

-- 
Kent
---
http://www.datanoise.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/ferret-talk/attachments/20060906/dad0d021/attachment.html

Maybe Matching Threads

Search for more maybe matching threads

Ferret talk - Sep 2006 - Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

[Ferret-talk] Which analyzer to use

Maybe Matching Threads