On 10/21/06, ahFeel <jbordier at rift.fr> wrote:> Hi :)
>
> Here''s a little code reproducing something that i consider as a
bug, if
> it''s not please explain :]
>
> http://pastie.caboo.se/18693
Hi J?r?mie,
You can get rid of this behaviour by building your own analyzer and
not including the HyphenFilter. This is a tricky issue which I haven''t
quite worked out yet. For example, when you search for "set-up" do you
want that to match "set up" and "setup". What if you search
for
"setup" or "set up"? Should they match all three versions
too? With
the current HyphenFilter these all three versions in queries will
match all three versions in the index. However, this comes at the loss
of recall. The problems occur during phrase queries. To make it so
that "set-up" matches both "setup" and "set up",
"set-up" is analyzed
as "set up and "setup" so in the first position there are two
words in
the tokenstream; "set" and "setup". When I parse the phrase
"set-up
files" I get the two phrases:
"set____up__files"
"setup______files"
So as you can see the second phrase only has two terms. so there is a
gap in betwen. To get the phrase "setup files" to match this I need to
give it a slop value.
Now I realize the solution is not ideal. I''ve had to forsake some
precision for a gain in recall but I can''t think of a better way. If
you can come up with a fool-proof way to handle hyphenated terms I''d
love to hear it. I will probably remove the HyphenFilter from the
StandardFilter in a futer version if I can''t think of a better way to
do this.
By the way, for the people reading this who think that "setup" is not
a word, I agree so consider "e-mail" and "email" instead.
Cheers,
Dave
PS: I''ve pasted the code below for reference. I''m not sure how
long
the pasties stick around for.
require ''rubygems''
require ''ferret''
path = "/tmp/index"
system("rm -rf #{path}; mkdir -p #{path}")
index = Ferret::Index::Index.new(:path => path)
index << {:type => :bug, :name => ''foo-bar''}
index << {:type => :bug, :name => ''foo-bar-core''}
queries = [''foo-bar'', ''foo-core'']
queries.each do |name|
query = "type:bug AND name:#{name}"
puts "\nquery : #{query}"
res = index.search(query)
puts "total hits = #{res.total_hits}"
res.hits.each { |x| p index[x.doc].load.inspect }
end