thr3ads.net - Ferret talk - [Ferret-talk] Regexpr. analyzer [Oct 2006]

If this information is useful, please help other people find it:
Share via:

hawe

2006-Oct-27 15:15 UTC

[Ferret-talk] Regexpr. analyzer

Hi!

I want to index html files, but w/o the tags, so I was thinking either I 
remove them before I index it (expensive), or put up an RegExpAnalyzer. 
BTW, when using an analyzer, does that mean that everything which it 
declines (i.e. the RegExpAnalyzer doesn''t match) won''t be put
into the
index files (i.e. blows it up)?

I came up with a simple test, which didn''t work in act_as_ferret, but 
now in pure ferret doesn''t work as well. I expected, with the code 
below, that only "abc" will be indexed, as only it matches the
regexpr.
What''s wrong?

@index = Ferret::Index::Index.new(:path => 
''c:/projects/peter/lib/ferretidx'',
    :analyzer => RegExpAnalyzer.new(/[a-f]/))

@index << {:id => "15", :title => "Programming
Ruby", :content =>
"<strong>some thing abc</strong>"}

@index.search_each(''content:"some"'') do |id, score|
   puts "Document #{id} found with a score of #{score}"
end

Thanks a lot,
hawe.

-- 
Posted via http://www.ruby-forum.com/.

Andreas Korth

2006-Oct-27 17:45 UTC

head link

[Ferret-talk] Regexpr. analyzer

On 27.10.2006, at 17:15, hawe wrote:
> I want to index html files, but w/o the tags, so I was thinking  
> either I
> remove them before I index it (expensive), or put up an  
> RegExpAnalyzer.
What''s so expensive about stripping the tags prior to adding the html  
to the index? I''m not sure which regex engine RegExpAnalyzer uses,  
but the Ruby''s regex engine is implemented in C, so it
shouldn''t make
much of a difference.
> BTW, when using an analyzer, does that mean that everything which it
> declines (i.e. the RegExpAnalyzer doesn''t match) won''t be
put into the
> index files (i.e. blows it up)?
Yep. That''s why you should use this analyzer only for the field  
that''s used to index the HTML, perhaps by using a PerFieldAnalzyer.
> I came up with a simple test, which didn''t work in act_as_ferret,
but
> now in pure ferret doesn''t work as well. I expected, with the code
> below, that only "abc" will be indexed, as only it matches the  
> regexpr.
> What''s wrong?
>
> @index = Ferret::Index::Index.new(:path =>
> ''c:/projects/peter/lib/ferretidx'',
>     :analyzer => RegExpAnalyzer.new(/[a-f]/))
>
> @index << {:id => "15", :title => "Programming
Ruby", :content =>
> "<strong>some thing abc</strong>"}
>
> @index.search_each(''content:"some"'') do |id,
score|
>    puts "Document #{id} found with a score of #{score}"
> end
Consider:

index = Ferret::I.new(:analyzer =>  
Ferret::Analysis::RegExpAnalyzer.new(/[a-f]/))

index << "prose"
index << "fade"

index.search("prose").total_hits  # -> 2

What happens is that "prose" becomes "e" and
"fade" goes untouched.
Ferret uses the same analyzer for indexing and query parsing. As a  
consequence, index.search("prose") becomes index.search("e")
which
matches both "fade" and "prose".

I''d suggest you use a separate tag stripper instead of using  
RegExpAnalyzer. Proper tag stripping is not a trivial RegExp,  
especially if you''re dealing with non-well-formed documents.

HTH,
Andy

Reasonably Related Threads

Search for more reasonably related threads

Ferret talk - Oct 2006 - Regexpr. analyzer

[Ferret-talk] Regexpr. analyzer

[Ferret-talk] Regexpr. analyzer

Reasonably Related Threads