On Sun, Nov 02, 2014 at 02:45:40PM -0800, Mir Siadaty
wrote:> A question regarding count of matching documents returned by
> Xapain;
> When I use OR to create a query of a term and some of its
> related synonyms, it appears to me xapian double-counts some docs.
> The extreme example is the count of query ?term1? versus ?term1
> OR term1?. I assumed the two queries should return the same counts, but the
> second query returns twice.
> Have I missed anything?
I assume you're talking about the numbers returned by
MSet::get_matches_estimated()? If so, you should note the word
"estimated" in the method name.
In this case, the estimated number of occurrences of a OR b is evaluated
based on the assumption that a and b occur independently of one another
(and then it may get clamped based on information we get from actually
running the match).
The assumption of independence is clearly particularly untrue when a and
b are both the same term, but we don't currently try to detect this
special case. It could be done, though I think it would be more useful
to handle a broader class of situations via a form of
common-subexpression elimination (CSE).
Incidentally, it's not a double count in general - e.g. if you have 100
documents and term1 occurs in 50, the estimate for term1 OR term1 would
be 75. But when term frequency << collection size, it will be just
under double, and rounding may often make it exactly double.
You may also find the discussion in the FAQ useful:
http://trac.xapian.org/wiki/FAQ/MoreAccurateEstimates
Cheers,
Olly