On 27.10.2006, at 17:15, hawe wrote:
> I want to index html files, but w/o the tags, so I was thinking
> either I
> remove them before I index it (expensive), or put up an
> RegExpAnalyzer.
What''s so expensive about stripping the tags prior to adding the html
to the index? I''m not sure which regex engine RegExpAnalyzer uses,
but the Ruby''s regex engine is implemented in C, so it
shouldn''t make
much of a difference.
> BTW, when using an analyzer, does that mean that everything which it
> declines (i.e. the RegExpAnalyzer doesn''t match) won''t be
put into the
> index files (i.e. blows it up)?
Yep. That''s why you should use this analyzer only for the field
that''s used to index the HTML, perhaps by using a PerFieldAnalzyer.
> I came up with a simple test, which didn''t work in act_as_ferret,
but
> now in pure ferret doesn''t work as well. I expected, with the code
> below, that only "abc" will be indexed, as only it matches the
> regexpr.
> What''s wrong?
>
> @index = Ferret::Index::Index.new(:path =>
> ''c:/projects/peter/lib/ferretidx'',
> :analyzer => RegExpAnalyzer.new(/[a-f]/))
>
> @index << {:id => "15", :title => "Programming
Ruby", :content =>
> "<strong>some thing abc</strong>"}
>
> @index.search_each(''content:"some"'') do |id,
score|
> puts "Document #{id} found with a score of #{score}"
> end
Consider:
index = Ferret::I.new(:analyzer =>
Ferret::Analysis::RegExpAnalyzer.new(/[a-f]/))
index << "prose"
index << "fade"
index.search("prose").total_hits # -> 2
What happens is that "prose" becomes "e" and
"fade" goes untouched.
Ferret uses the same analyzer for indexing and query parsing. As a
consequence, index.search("prose") becomes index.search("e")
which
matches both "fade" and "prose".
I''d suggest you use a separate tag stripper instead of using
RegExpAnalyzer. Proper tag stripping is not a trivial RegExp,
especially if you''re dealing with non-well-formed documents.
HTH,
Andy