Matthew Somerville
2009-Jun-10 23:29 UTC
[Xapian-discuss] Very far out and static get_matches_estimated
Hi, I'm getting quite odd results using get_matches_estimated() that I haven't seen before; we've just added a bunch of new data to the database. Xapian 1.0.7, checkatleast is set to 100. The database will get new stuff added to it automatically around 8.30am BST, so it's possible this might affect the links I provide, I guess. But I'll note what is currently happening as I write. http://www.theyworkforyou.com/search/?pop=1&s=statistics+19950101..19951231 currently returns 1-20 of 14,678; page 18 gives 341-360 of 14,678: http://www.theyworkforyou.com/search/?pop=1&s=statistics+19950101..19951231&p=18 But then page 19 gives 361-362 of 362, which is correct: http://www.theyworkforyou.com/search/?s=statistics+19950101..19951231&p=19 So the estimate is wildly out for all pages until we get to the actual number of results. Changing the sort to relevance instead of reverse date gives a different far out number, but the effect is the same. Without the date range limiting, the initial estimate is 43,612, and this slowly changes as I up the page count until it gets to the correct result of 43,537 (good initial estimate!), as I'd expect. It's also set by default to collapse per debate, but turning that off doesn't make any difference, it gives initially "1-20 of 30,249", up to "721-740 of 30,249" but then "741-746 of 746". Any ideas? ATB, Matthew
Olly Betts
2009-Jun-11 05:00 UTC
[Xapian-discuss] Very far out and static get_matches_estimated
On Thu, Jun 11, 2009 at 12:29:06AM +0100, Matthew Somerville wrote:> So the estimate is wildly out for all pages until we get to the actual > number of results. Changing the sort to relevance instead of reverse > date gives a different far out number, but the effect is the same. > Without the date range limiting, the initial estimate is 43,612, and > this slowly changes as I up the page count until it gets to the correct > result of 43,537 (good initial estimate!), as I'd expect.Well, the estimate is an estimate, and may be far from the true value. While it might not be helpful, if it's >= lower_bound and <upper_bound, then it's "working". You can look at the bounds to see how wrong it might be: http://trac.xapian.org/wiki/FAQ/MoreAccurateEstimates If you don't want it to be way out when there are 362 matches, setting checkatleast to 363 or more will address that. By default, Xapian assumes you are more interested in getting the result fast than having a very accurate estimate of how many there are. The particular problem here is that we don't have a good way to estimate what proportion of a value range will match, so currently we just guess arbitrarily that it will match half the documents it sees. In 1.0 it's not easy to do better. We could monitor what proportion of documents the value range is checked for match, but unfortunately actually using that information would need some big changes to how things work. Perhaps assuming that it matches 1/10 of the documents would be better - in most cases, underestimating is better than overestimating I suspect. In 1.1, chert keeps bounds on the value and knows how many documents it is set for, which could be used with the value range bounds to make an estimate assuming an even spread of values, but we don't currently do so. The key thing needed is a function to efficiently calculate how far through a string range a given string is (e.g. "b" is 0.5 of "a".."c"). Essentially, base 256 fixed point arithmetic... Cheers, Olly