Hi, I''m confused about managing field boosting ... I have set the :boost for the :name field in my docs to 10, via :boost => 10 Then I performed a search for ''keith'' over all fields via with *:(keith*), expecting a doc with Keith in the :name field to come out on top. But another doc with Keith mentioned in other fields (:comments, :address) scored higher. I viewed the explanation from the searcher, but it wasn''t clear to me why the boost wasn''t pushing the :name = Keith document to the top. Any help on understanding field boosting and explain would be great. Regards Neville PS, the two explains are: Doc1: 0.3352959 = product of: 8.047102 = sum of: 4.011141 = weight(comments:<keith|keithb at zzzzzz.com|keithex> in 4697), product of: 0.5685414 query_weight(comments:<keith|keithb at zzzzzz.com|keithex>), product of: 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + (keith=115) = 117>) 0.02014635 = query_norm 7.055143 = field_weight(comments:<keith|keithb at zzzzzz.com|keithex> in 4697), product of: 1.0 = The sum of: 1.0 = tf(term_freq(comments:keithex)=1)^1.0 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + (keith=115) = 117>) 0.25 = field_norm(field=comments, doc=4697) 4.03596 = weight(address:<keith|keithex> in 4697), product of: 0.4032613 = query_weight(address:<keith|keithex>), product of: 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) 0.02014635 = query_norm 10.0083 = field_weight(address:<keith|keithex> in 4697), product of: 1.0 = The sum of: 1.0 = tf(term_freq(address:keithex)=1)^1.0 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) 0.5 = field_norm(field=address, doc=4697) 0.04166667 = coord(2/48) Doc2: 0.2977623 = product of: 14.29259 = weight(name:<keith> in 31416), product of: 0.2028171 = query_weight(name:<keith>), product of: 10.06719 = idf(name:<(keith=3) = 3>) 0.02014635 = query_norm 70.47034 = field_weight(name:<keith> in 31416), product of: 1.0 = The sum of: 1.0 = tf(term_freq(name:keith)=1)^1.0 10.06719 = idf(name:<(keith=3) = 3>) 7.0 = field_norm(field=name, doc=31416) 0.02083333 = coord(1/48)
Hi! On Wed, Sep 20, 2006 at 03:40:03PM +1000, Neville Burnell wrote:> Hi, > > I''m confused about managing field boosting ... > > I have set the :boost for the :name field in my docs to 10, via :boost > => 10 > > Then I performed a search for ''keith'' over all fields via with > *:(keith*), expecting a doc with Keith in the :name field to come out on > top. But another doc with Keith mentioned in other fields (:comments, > :address) scored higher. > > I viewed the explanation from the searcher, but it wasn''t clear to me > why the boost wasn''t pushing the :name = Keith document to the top.as you can see from the explanation, the score for both fields that matched the query got summed up (8... = sum of:), if ''keith'' only had shown up in one field, the other document would have had the higher score. I don''t know of any methodology to determine the proper boost setting for a field, imho it''s just a question of experimenting with queries and the results you expect. If you always want to have matches in the name ranked on the top, regardless of how many times a term is mentioned in other parts of your document, set the boost to 100 ;-) I don''t know what the coord value is, though, maybe someone else can step in here ? Jens> PS, the two explains are: > > Doc1: > 0.3352959 = product of: > 8.047102 = sum of: > 4.011141 = weight(comments:<keith|keithb at zzzzzz.com|keithex> in > 4697), product of: > 0.5685414 > query_weight(comments:<keith|keithb at zzzzzz.com|keithex>), product of: > 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + > (keith=115) = 117>) > 0.02014635 = query_norm > 7.055143 = field_weight(comments:<keith|keithb at zzzzzz.com|keithex> > in 4697), product of: > 1.0 = The sum of: > 1.0 = tf(term_freq(comments:keithex)=1)^1.0 > 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + > (keith=115) = 117>) > 0.25 = field_norm(field=comments, doc=4697) > 4.03596 = weight(address:<keith|keithex> in 4697), product of: > 0.4032613 = query_weight(address:<keith|keithex>), product of: > 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) > 0.02014635 = query_norm > 10.0083 = field_weight(address:<keith|keithex> in 4697), product > of: > 1.0 = The sum of: > 1.0 = tf(term_freq(address:keithex)=1)^1.0 > 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) > 0.5 = field_norm(field=address, doc=4697) > 0.04166667 = coord(2/48) > > > Doc2: > 0.2977623 = product of: > 14.29259 = weight(name:<keith> in 31416), product of: > 0.2028171 = query_weight(name:<keith>), product of: > 10.06719 = idf(name:<(keith=3) = 3>) > 0.02014635 = query_norm > 70.47034 = field_weight(name:<keith> in 31416), product of: > 1.0 = The sum of: > 1.0 = tf(term_freq(name:keith)=1)^1.0 > 10.06719 = idf(name:<(keith=3) = 3>) > 7.0 = field_norm(field=name, doc=31416) > 0.02083333 = coord(1/48) > > > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk-- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
On 9/20/06, Neville Burnell <Neville.Burnell at bmsoft.com.au> wrote:> Hi, > > I''m confused about managing field boosting ... > > I have set the :boost for the :name field in my docs to 10, via :boost > => 10 > > Then I performed a search for ''keith'' over all fields via with > *:(keith*), expecting a doc with Keith in the :name field to come out on > top. But another doc with Keith mentioned in other fields (:comments, > :address) scored higher. > > I viewed the explanation from the searcher, but it wasn''t clear to me > why the boost wasn''t pushing the :name = Keith document to the top. > > Any help on understanding field boosting and explain would be great. > > Regards > > Neville > > PS, the two explains are: > > Doc1: > 0.3352959 = product of: > 8.047102 = sum of: > 4.011141 = weight(comments:<keith|keithb at zzzzzz.com|keithex> in > 4697), product of: > 0.5685414 > query_weight(comments:<keith|keithb at zzzzzz.com|keithex>), product of: > 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + > (keith=115) = 117>) > 0.02014635 = query_norm > 7.055143 = field_weight(comments:<keith|keithb at zzzzzz.com|keithex> > in 4697), product of: > 1.0 = The sum of: > 1.0 = tf(term_freq(comments:keithex)=1)^1.0 > 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + > (keith=115) = 117>) > 0.25 = field_norm(field=comments, doc=4697) > 4.03596 = weight(address:<keith|keithex> in 4697), product of: > 0.4032613 = query_weight(address:<keith|keithex>), product of: > 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) > 0.02014635 = query_norm > 10.0083 = field_weight(address:<keith|keithex> in 4697), product > of: > 1.0 = The sum of: > 1.0 = tf(term_freq(address:keithex)=1)^1.0 > 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) > 0.5 = field_norm(field=address, doc=4697) > 0.04166667 = coord(2/48) > > > Doc2: > 0.2977623 = product of: > 14.29259 = weight(name:<keith> in 31416), product of: > 0.2028171 = query_weight(name:<keith>), product of: > 10.06719 = idf(name:<(keith=3) = 3>) > 0.02014635 = query_norm > 70.47034 = field_weight(name:<keith> in 31416), product of: > 1.0 = The sum of: > 1.0 = tf(term_freq(name:keith)=1)^1.0 > 10.06719 = idf(name:<(keith=3) = 3>) > 7.0 = field_norm(field=name, doc=31416) > 0.02083333 = coord(1/48)Hi Neville, The field''s boost value affects the field_norm value in the Explanations above. Here is how it is calculated: field_norm = field_info->boost * doc->boost * field->boost * (1 / sqrt(field->num_terms) So as you can see from the Explanations above, field_norm is 7.0 on the boosted field which is more than 10 times the field_norms on the other two fields (0.25, 0.5) so at least you can see the boost is having an effect. The address field probably has a higher field_norm value than the comments field because the comments field is longer (see that last part of the field_norm equation). Note that the reason the boost is 7.0 and not 10.0 is that the field_norm gets stored in a single byte so there is quite a large loss of precision. Having said all this, there does seem to be a problem with the calculations. I don''t think I''ve calculated the idf value correctly for MultiTermQueries. I''ve rectified this in subversion so the next version should give your results in an order that you''d expect. For information on tf and idf, check out this page: http://en.wikipedia.org/wiki/Tf-idf Hope that helps. I''d love to give a better explanation of the scoring but I don''t have time right now. Cheers, Dave
On 9/20/06, Jens Kraemer <kraemer at webit.de> wrote:> Hi! > > On Wed, Sep 20, 2006 at 03:40:03PM +1000, Neville Burnell wrote: > > Hi, > > > > I''m confused about managing field boosting ... > > > > I have set the :boost for the :name field in my docs to 10, via :boost > > => 10 > > > > Then I performed a search for ''keith'' over all fields via with > > *:(keith*), expecting a doc with Keith in the :name field to come out on > > top. But another doc with Keith mentioned in other fields (:comments, > > :address) scored higher. > > > > I viewed the explanation from the searcher, but it wasn''t clear to me > > why the boost wasn''t pushing the :name = Keith document to the top. > > as you can see from the explanation, the score for both fields that > matched the query got summed up (8... = sum of:), if ''keith'' only had > shown up in one field, the other document would have had the higher > score. > > I don''t know of any methodology to determine the proper boost setting > for a field, imho it''s just a question of experimenting with queries and > the results you expect. > > If you always want to have matches in the name ranked on the top, > regardless of how many times a term is mentioned in other parts of your > document, set the boost to 100 ;-) > > I don''t know what the coord value is, though, maybe someone else can > step in here ? > > JensThe coord factor is the number of clauses in a BooleanQuery that matched over the number of clauses. It would seem that in the example, there were 48 clauses. When you submit a query over all fields (ie. "*:term") the query is rewritten as a boolean query with a clause for every field in your index. So it would seem that Neville has 48 fields in his index. Hope that makes sense, Dave PS: This might be a good time to mention that if you have an index with a lot of fields like this, it is probably worth thinking about what to set the :default_field and :all_fields parameters to. :all_fields is what "*:#{query}" expands to. It doesn''t necessarily have to be all fields in the index. Usually you only want "*" to expand to all text fields, not actually all fields. For example, you''d probably want date fields to be excluded. And I''ve only just fixed this so it will work when you use a Ferret::Index::Index object. Previously the QueryParser had all fields in the index added to the :all_fields parameter. Now that only happens if :all_fields isn''t set explicitly.
Thanks Dave, Having boost seamingly absent from the explain calculation confused me, but your explanation of field_norm helps a lot. Neville -----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of David Balmain Sent: Wednesday, 20 September 2006 8:22 PM To: ferret-talk at rubyforge.org Subject: Re: [Ferret-talk] Understanding boost ? On 9/20/06, Neville Burnell <Neville.Burnell at bmsoft.com.au> wrote:> Hi, > > I''m confused about managing field boosting ... > > I have set the :boost for the :name field in my docs to 10, via :boost> => 10 > > Then I performed a search for ''keith'' over all fields via with > *:(keith*), expecting a doc with Keith in the :name field to come out > on top. But another doc with Keith mentioned in other fields > (:comments, > :address) scored higher. > > I viewed the explanation from the searcher, but it wasn''t clear to me > why the boost wasn''t pushing the :name = Keith document to the top. > > Any help on understanding field boosting and explain would be great. > > Regards > > Neville > > PS, the two explains are: > > Doc1: > 0.3352959 = product of: > 8.047102 = sum of: > 4.011141 = weight(comments:<keith|keithb at zzzzzz.com|keithex> in > 4697), product of: > 0.5685414 > query_weight(comments:<keith|keithb at zzzzzz.com|keithex>), product of: > 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + > (keith=115) = 117>) > 0.02014635 = query_norm > 7.055143 = > field_weight(comments:<keith|keithb at zzzzzz.com|keithex> > in 4697), product of: > 1.0 = The sum of: > 1.0 = tf(term_freq(comments:keithex)=1)^1.0 > 28.22057 = idf(comments:<(keithex=1) + (keithb at zzzzzz.com=1) + > (keith=115) = 117>) > 0.25 = field_norm(field=comments, doc=4697) > 4.03596 = weight(address:<keith|keithex> in 4697), product of: > 0.4032613 = query_weight(address:<keith|keithex>), product of: > 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) > 0.02014635 = query_norm > 10.0083 = field_weight(address:<keith|keithex> in 4697), product > of: > 1.0 = The sum of: > 1.0 = tf(term_freq(address:keithex)=1)^1.0 > 20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>) > 0.5 = field_norm(field=address, doc=4697) > 0.04166667 = coord(2/48) > > > Doc2: > 0.2977623 = product of: > 14.29259 = weight(name:<keith> in 31416), product of: > 0.2028171 = query_weight(name:<keith>), product of: > 10.06719 = idf(name:<(keith=3) = 3>) > 0.02014635 = query_norm > 70.47034 = field_weight(name:<keith> in 31416), product of: > 1.0 = The sum of: > 1.0 = tf(term_freq(name:keith)=1)^1.0 > 10.06719 = idf(name:<(keith=3) = 3>) > 7.0 = field_norm(field=name, doc=31416) > 0.02083333 = coord(1/48)Hi Neville, The field''s boost value affects the field_norm value in the Explanations above. Here is how it is calculated: field_norm = field_info->boost * doc->boost * field->boost * (1 / sqrt(field->num_terms) So as you can see from the Explanations above, field_norm is 7.0 on the boosted field which is more than 10 times the field_norms on the other two fields (0.25, 0.5) so at least you can see the boost is having an effect. The address field probably has a higher field_norm value than the comments field because the comments field is longer (see that last part of the field_norm equation). Note that the reason the boost is 7.0 and not 10.0 is that the field_norm gets stored in a single byte so there is quite a large loss of precision. Having said all this, there does seem to be a problem with the calculations. I don''t think I''ve calculated the idf value correctly for MultiTermQueries. I''ve rectified this in subversion so the next version should give your results in an order that you''d expect. For information on tf and idf, check out this page: http://en.wikipedia.org/wiki/Tf-idf Hope that helps. I''d love to give a better explanation of the scoring but I don''t have time right now. Cheers, Dave _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk
-----Original Message----- From: ferret-talk-bounces at rubyforge.org [mailto:ferret-talk-bounces at rubyforge.org] On Behalf Of David Balmain Sent: Wednesday, 20 September 2006 9:50 PM To: ferret-talk at rubyforge.org Subject: Re: [Ferret-talk] Understanding boost ? On 9/20/06, Jens Kraemer <kraemer at webit.de> wrote:> Hi! > > On Wed, Sep 20, 2006 at 03:40:03PM +1000, Neville Burnell wrote: > > Hi, > > > > I''m confused about managing field boosting ... > > > > I have set the :boost for the :name field in my docs to 10, via > > :boost => 10 > > > > Then I performed a search for ''keith'' over all fields via with > > *:(keith*), expecting a doc with Keith in the :name field to come > > out on top. But another doc with Keith mentioned in other fields > > (:comments, > > :address) scored higher. > > > > I viewed the explanation from the searcher, but it wasn''t clear to > > me why the boost wasn''t pushing the :name = Keith document to thetop.> > as you can see from the explanation, the score for both fields that > matched the query got summed up (8... = sum of:), if ''keith'' only had > shown up in one field, the other document would have had the higher > score. > > I don''t know of any methodology to determine the proper boost setting > for a field, imho it''s just a question of experimenting with queries > and the results you expect. > > If you always want to have matches in the name ranked on the top, > regardless of how many times a term is mentioned in other parts of > your document, set the boost to 100 ;-) > > I don''t know what the coord value is, though, maybe someone else can > step in here ? > > JensThe coord factor is the number of clauses in a BooleanQuery that matched over the number of clauses. It would seem that in the example, there were 48 clauses. When you submit a query over all fields (ie. "*:term") the query is rewritten as a boolean query with a clause for every field in your index. So it would seem that Neville has 48 fields in his index. Hope that makes sense, Dave PS: This might be a good time to mention that if you have an index with a lot of fields like this, it is probably worth thinking about what to set the :default_field and :all_fields parameters to. :all_fields is what "*:#{query}" expands to. It doesn''t necessarily have to be all fields in the index. Usually you only want "*" to expand to all text fields, not actually all fields. For example, you''d probably want date fields to be excluded. And I''ve only just fixed this so it will work when you use a Ferret::Index::Index object. Previously the QueryParser had all fields in the index added to the :all_fields parameter. Now that only happens if :all_fields isn''t set explicitly. _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk
>> it is probably worth thinking about what to >> set the :default_field and :all_fields parameters to.Hi Dave, Thanks for pointing this out. Neville -----Original Message----- PS: This might be a good time to mention that if you have an index with a lot of fields like this, it is probably worth thinking about what to set the :default_field and :all_fields parameters to. :all_fields is what "*:#{query}" expands to. It doesn''t necessarily have to be all fields in the index. Usually you only want "*" to expand to all text fields, not actually all fields. For example, you''d probably want date fields to be excluded. And I''ve only just fixed this so it will work when you use a Ferret::Index::Index object. Previously the QueryParser had all fields in the index added to the :all_fields parameter. Now that only happens if :all_fields isn''t set explicitly. _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk
>> it would seem that Neville has 48 fields in his index.Yes, there are 48 fields. But around 17 fields are marked as :index => :no because they are only used as detail in the retrieved doc and not for indexing purposes. Shouldn''t that affect both the coord factor and the :all_fields expansion ? Kind Regards Neville -----Original Message----- The coord factor is the number of clauses in a BooleanQuery that matched over the number of clauses. It would seem that in the example, there were 48 clauses. When you submit a query over all fields (ie. "*:term") the query is rewritten as a boolean query with a clause for every field in your index. So it would seem that Neville has 48 fields in his index. Hope that makes sense, Dave PS: This might be a good time to mention that if you have an index with a lot of fields like this, it is probably worth thinking about what to set the :default_field and :all_fields parameters to. :all_fields is what "*:#{query}" expands to. It doesn''t necessarily have to be all fields in the index. Usually you only want "*" to expand to all text fields, not actually all fields. For example, you''d probably want date fields to be excluded. And I''ve only just fixed this so it will work when you use a Ferret::Index::Index object. Previously the QueryParser had all fields in the index added to the :all_fields parameter. Now that only happens if :all_fields isn''t set explicitly. _______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk
On 9/21/06, Neville Burnell <Neville.Burnell at bmsoft.com.au> wrote:> >> it would seem that Neville has 48 fields in his index. > > Yes, there are 48 fields. > > But around 17 fields are marked as :index => :no because they are only > used as detail in the retrieved doc and not for indexing purposes. > > Shouldn''t that affect both the coord factor and the :all_fields > expansion ? > > Kind Regards > > NevilleYes, that''s a good idea. It wouldn''t be too much trouble to modify Index to only add indexed fields to the :all_fields value. I''ll change this in the next release. Even with 17 fields though it may still be worth setting :all_fields up yourself, but I have no idea what is in those 17 fields so I may be wrong. By the way, the coord factors devisor will always be equal to the size of :all_fields in this type of query so I just need to fix the setting of :all_fields. Cheers, Dave