Colin Cc
2006-Sep-26 22:13 UTC
[Ferret-talk] Scoring/similarity, biased towards small fields?
Lucene, and perhaps most search engines, are biased towards small fields with little content (where thus the term frequency is higher). Lucene has the option to define a custom (Similarity) class to calculate the similarity between two fields (custom calculation of lengthNorm and tf) in different documents. But how do I do this in ferret? (I know to boost a field, but this is not what I (think to) need, I need to be able to influence the relative importance between the same field) -- Posted via http://www.ruby-forum.com/.
Colin Cc
2006-Sep-26 22:16 UTC
[Ferret-talk] Scoring/similarity, biased towards small fields?
Forgot to say, ferret seems to be really amazing, especially considering how much it has been improved in the last couple of months! -- Posted via http://www.ruby-forum.com/.
David Balmain
2006-Sep-27 03:30 UTC
[Ferret-talk] Scoring/similarity, biased towards small fields?
On 9/27/06, Colin Cc <colin at 3x.to> wrote:> Lucene, and perhaps most search engines, are biased towards small fields > with little content (where thus the term frequency is higher). Lucene > has the option to define a custom (Similarity) class to calculate the > similarity between two fields (custom calculation of lengthNorm and tf) > in different documents. But how do I do this in ferret? (I know to boost > a field, but this is not what I (think to) need, I need to be able to > influence the relative importance between the same field) >Hi Colin, Ferret uses the same similarity scoring as Lucene. Scoring is based more on the ratio of number of matches to the length of the field, rather than just the length of the field. So a small field with a single match will score higher than a large field with a single match. But a large field with many matches may still score more highly than a small field with a single match. The Similarity class is still unavailable in the Ruby API and it isn''t high on my list of priorities to write the bindings for it (unless someone was willing to compensate me). However, I don''t think you need it for what you are describing. Boosts should do the job perfectly. If you want to make the :title field more important than the :content field then you set the boost of the :title FieldInfo, probably like this: fi = FieldInfos.new fi.add_field(:title, :boost => 10.0) But I think you want to make the same field more important in different documents. So you can set the boost of the field when you add it. You can either set the boost for the whole document: doc = Ferret::Document.new(20.0) doc[:title] = "Braveheart" doc[:actors] = ["Mel Gibson", "Sophie Marceau"] This will affect all fields in the document. Or you can set the boost of the field directly. doc = { :title => Field.new("Legally Blonde", 0.02), :actors => Field.new(["Reese Witherspoon", "Luke Wilson"], 2.0) } Hope that helps, Cheers, Dave
Colin Cc
2006-Sep-27 09:30 UTC
[Ferret-talk] Scoring/similarity, biased towards small fields?
Thanks for answering! I couldn''t find anything of relevance in the docs/api, now i know not to look for that functionality in the ruby api again :) Actually boosting doesn''t really help in my case. I use lucene to index some articles with bodies of variable length. But whether a word occurs in a short or long article, the article is supposed to be equally relevant (of course, words occurring in title fields will make the result more important, for that there is boosting (and this bias towards short fields)) But it''s only a small issue, maybe i''ll start spitting through the source-code sometime to see if i can add it. -- Posted via http://www.ruby-forum.com/.