thr3ads.net - Ferret talk - [Ferret-talk] Scoring/similarity, biased towards small fields? [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Colin Cc

2006-Sep-26 22:13 UTC

[Ferret-talk] Scoring/similarity, biased towards small fields?

Lucene, and perhaps most search engines, are biased towards small fields 
with little content (where thus the term frequency is higher). Lucene 
has the option to define a custom (Similarity) class to calculate the 
similarity between two fields (custom calculation of lengthNorm and tf) 
in different documents. But how do I do this in ferret? (I know to boost 
a field, but this is not what I (think to) need, I need to be able to 
influence the relative importance between the same field)

-- 
Posted via http://www.ruby-forum.com/.

Colin Cc

2006-Sep-26 22:16 UTC

head link

[Ferret-talk] Scoring/similarity, biased towards small fields?

Forgot to say, ferret seems to be really amazing, especially considering 
how much it has been improved in the last couple of months!

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Sep-27 03:30 UTC

head link

[Ferret-talk] Scoring/similarity, biased towards small fields?

On 9/27/06, Colin Cc <colin at 3x.to> wrote:> Lucene, and perhaps most search engines, are biased towards small fields
> with little content (where thus the term frequency is higher). Lucene
> has the option to define a custom (Similarity) class to calculate the
> similarity between two fields (custom calculation of lengthNorm and tf)
> in different documents. But how do I do this in ferret? (I know to boost
> a field, but this is not what I (think to) need, I need to be able to
> influence the relative importance between the same field)
>
Hi Colin,

Ferret uses the same similarity scoring as Lucene. Scoring is based
more on the ratio of number of matches to the length of the field,
rather than just the length of the field. So a small field with a
single match will score higher than a large field with a single match.
But a large field with many matches may still score more highly than a
small field with a single match.

The Similarity class is still unavailable in the Ruby API and it isn''t
high on my list of priorities to write the bindings for it (unless
someone was willing to compensate me). However, I don''t think you need
it for what you are describing. Boosts should do the job perfectly. If
you want to make the :title field more important than the :content
field then you set the boost of the :title FieldInfo, probably like
this:

    fi = FieldInfos.new
    fi.add_field(:title, :boost => 10.0)

But I think you want to make the same field more important in
different documents. So you can set the boost of the field when you
add it. You can either set the boost for the whole document:

    doc = Ferret::Document.new(20.0)
    doc[:title] = "Braveheart"
    doc[:actors] = ["Mel Gibson", "Sophie Marceau"]

This will affect all fields in the document. Or you can set the boost
of the field directly.

    doc = {
        :title => Field.new("Legally Blonde", 0.02),
        :actors => Field.new(["Reese Witherspoon", "Luke
Wilson"], 2.0)
    }

Hope that helps,
Cheers,
Dave

Colin Cc

2006-Sep-27 09:30 UTC

head link

[Ferret-talk] Scoring/similarity, biased towards small fields?

Thanks for answering! I couldn''t find anything of relevance in the 
docs/api, now i know not to look for that functionality in the ruby api 
again :)

Actually boosting doesn''t really help in my case. I use lucene to index
some articles with bodies of variable length. But whether a word occurs 
in a short or long article, the article is supposed to be equally 
relevant (of course, words occurring in title fields will make the 
result more important, for that there is boosting (and this bias towards 
short fields))

But it''s only a small issue, maybe i''ll start spitting through
the
source-code sometime to see if i can add it.


-- 
Posted via http://www.ruby-forum.com/.

Possibly Parallel Threads

Search for more apparently analagous threads

Ferret talk - Sep 2006 - Scoring/similarity, biased towards small fields?

[Ferret-talk] Scoring/similarity, biased towards small fields?

[Ferret-talk] Scoring/similarity, biased towards small fields?

[Ferret-talk] Scoring/similarity, biased towards small fields?

[Ferret-talk] Scoring/similarity, biased towards small fields?

Possibly Parallel Threads