thr3ads.net - Xapian discuss - [Xapian-discuss] Very far out and static get_matches

If this information is useful, please help other people find it:
Share via:

Matthew Somerville

2009-Jun-10 23:29 UTC

[Xapian-discuss] Very far out and static get_matches_estimated

Hi,

I'm getting quite odd results using get_matches_estimated() that I 
haven't seen before; we've just added a bunch of new data to the 
database. Xapian 1.0.7, checkatleast is set to 100.

The database will get new stuff added to it automatically around 8.30am 
BST, so it's possible this might affect the links I provide, I guess. 
But I'll note what is currently happening as I write.

http://www.theyworkforyou.com/search/?pop=1&s=statistics+19950101..19951231 
currently returns 1-20 of 14,678; page 18 gives 341-360 of 14,678:
http://www.theyworkforyou.com/search/?pop=1&s=statistics+19950101..19951231&p=18
But then page 19 gives 361-362 of 362, which is correct:
http://www.theyworkforyou.com/search/?s=statistics+19950101..19951231&p=19

So the estimate is wildly out for all pages until we get to the actual 
number of results. Changing the sort to relevance instead of reverse 
date gives a different far out number, but the effect is the same. 
Without the date range limiting, the initial estimate is 43,612, and 
this slowly changes as I up the page count until it gets to the correct 
result of 43,537 (good initial estimate!), as I'd expect.

It's also set by default to collapse per debate, but turning that off 
doesn't make any difference, it gives initially "1-20 of 30,249",
up to
"721-740 of 30,249" but then "741-746 of 746".

Any ideas?

ATB,
Matthew

Olly Betts

2009-Jun-11 05:00 UTC

head link

[Xapian-discuss] Very far out and static get_matches_estimated

On Thu, Jun 11, 2009 at 12:29:06AM +0100, Matthew Somerville
wrote:> So the estimate is wildly out for all pages until we get to the actual 
> number of results. Changing the sort to relevance instead of reverse 
> date gives a different far out number, but the effect is the same. 
> Without the date range limiting, the initial estimate is 43,612, and 
> this slowly changes as I up the page count until it gets to the correct 
> result of 43,537 (good initial estimate!), as I'd expect.
Well, the estimate is an estimate, and may be far from the true value.
While it might not be helpful, if it's >= lower_bound and
<upper_bound, then it's "working".  You can look at the bounds
to see
how wrong it might be:

http://trac.xapian.org/wiki/FAQ/MoreAccurateEstimates

If you don't want it to be way out when there are 362 matches, setting
checkatleast to 363 or more will address that.  By default, Xapian
assumes you are more interested in getting the result fast than having
a very accurate estimate of how many there are.

The particular problem here is that we don't have a good way to estimate
what proportion of a value range will match, so currently we just guess
arbitrarily that it will match half the documents it sees.

In 1.0 it's not easy to do better.  We could monitor what proportion of
documents the value range is checked for match, but unfortunately
actually using that information would need some big changes to how
things work.  Perhaps assuming that it matches 1/10 of the documents
would be better - in most cases, underestimating is better than
overestimating I suspect.

In 1.1, chert keeps bounds on the value and knows how many documents it
is set for, which could be used with the value range bounds to make an
estimate assuming an even spread of values, but we don't currently do
so.  The key thing needed is a function to efficiently calculate how far
through a string range a given string is (e.g. "b" is 0.5 of
"a".."c").
Essentially, base 256 fixed point arithmetic...

Cheers,
    Olly

Reasonably Related Threads

Search for more maybe matching threads

Xapian discuss - Jun 2009 - Very far out and static get_matches_estimated

[Xapian-discuss] Very far out and static get_matches_estimated

[Xapian-discuss] Very far out and static get_matches_estimated

Reasonably Related Threads