Marinos Yannikos
2010-Nov-01 00:59 UTC
[Xapian-discuss] floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)
I am using BM25 with k1=0 and min_normlen=1 to get weights unaffected by document length and term frequency in the document (min_normlen=1 isn't necessary I guess) and am expecting single-term weights to be identical for all matches. I have added a document value to steer such general search queries and it works fine, except that for some search terms, I get results like: weight (BM25) value ----------------------------------------------- 1. xxx (6.3564210045800955128925125 + 4.000000) 2. xxx (6.3564210045800955128925125 + 4.000000) 3. xxx (6.3564210045800955128925125 + 3.500000) 4. xxx (6.3564210045800946247140928 + 7.000000) 5. xxx (6.3564210045800946247140928 + 6.500000) 6. xxx (6.3564210045800946247140928 + 6.000000) 7. xxx (6.3564210045800946247140928 + 6.000000) 8. xxx (6.3564210045800946247140928 + 6.000000) 9. xxx (6.3564210045800946247140928 + 6.000000) 10. xxx (6.3564210045800946247140928 + 6.000000) The weights then always seem to differ after the 14th/15th fractional digit and only a small number of results is affected (3 out of ~16000 with a slightly lower weight in one case, 4 out of ~70000 with a slightly higher one in another). Platform is Debian Lenny 64bit, AMD Opteron CPUs, core-1.2.3 patched to r15140 and using chert. This also happens with complex queries where groups of results are expected to have identical weights. FIX: I found a simple fix for this issue, at least for my test cases: I added if (param_k1 == 0) RETURN(termweight); to the beginning of BM25Weight::get_sumpart in trunk/xapian-core/weight/bm25weight.cc:166 This apparently prevents floating point precision issues in the last line of get_sumpart() [which calculates termweight * wdf_double * 1 / wdf_double]. It also speeds up my case slightly. ;-) In order to prevent more such issues, it might be a good idea to round weights to a few fractional digits (10 should be enough) before using them as sort keys. Regards, Marinos
Olly Betts
2010-Nov-01 10:58 UTC
[Xapian-discuss] floating-point issues with set_sort_by_relevance_then_value? (1.2.3, BM25 k1=0)
On Mon, Nov 01, 2010 at 01:59:42AM +0100, Marinos Yannikos wrote:> This apparently prevents floating point precision issues in the last line > of get_sumpart() [which calculates termweight * wdf_double * 1 / > wdf_double].Yes, for some values of wdf_double and termweight, this doesn't give exactly termweight. We should do the division, and scale termweight by the result. I've reproduced this issue and I'm currently working on a fix.> It also speeds up my case slightly. ;-)How much is "slightly"? Or did you just mean it's doing less work, rather than that there's a measurable speed-up.> In order to prevent more such issues, it might be a good idea to round > weights to a few fractional digits (10 should be enough) before using > them as sort keys.Rounding isn't a magic solution to such issues, and explicitly rounding all the weights is extra work. I think it's better to focus on getting the calculations right rather than trying to disguise any problems. Cheers, Olly