Rusty Conover
2008-Jan-03 06:52 UTC
[Xapian-discuss] Question about synonyms and relevancy results.
Hi Guys, Why does the use synonyms decrease relevancy of the returned results? Running query 'Xapian::Query((Zserendip:(pos=1) AND Zjacket:(pos=2)))' 3 results found Estimated matches: 3 ID 39896 100% Serendipity Jacket mens ID 39947 98% Serendipity Hiking Jacket womens ID 39964 90% Serendipity Jacket womens But with synonyms the relevancy is decreased. Running query 'Xapian::Query((Zserendip:(pos=1) AND (Zjacket:(pos=2) OR Zcoat:(pos=2) OR Zparka:(pos=2))))' 3 results found Estimated matches: 3 ID 39947 72% Serendipity Hiking Jacket womens ID 39896 64% Serendipity Jacket mens ID 39964 58% Serendipity Jacket womens Obviously this is because more terms are involved, but is this correct, or can it be disabled so that the synonyms count as just one term with regards to relevancy? Normally I set a floor of a certain amount of relevancy before I present results to the user, but since synonyms decrease the relevancy I may need to change that, but doing that could lead to poor results being returned when it may make more sense to say no results were found. Thanks, Rusty P.S. Olly, Wellington is a wonderful city that I've had the pleasure to visit many times now, I think you'll find it quite nice.
James Aylett
2008-Jan-03 16:18 UTC
[Xapian-discuss] Question about synonyms and relevancy results.
On Wed, Jan 02, 2008 at 11:51:47PM -0700, Rusty Conover wrote:> Why does the use synonyms decrease relevancy of the returned results?Because the synonyms probably won't match documents that have the original terms (in the general case), so there's a lower proportion of terms in the query matching those documents. You can tweak the weighting scheme to ignore the within-query frequency of a term when generating weights (and hence percentage relevancy) in the MSet: you want to set k3 to 0. This may have a larger effect on the relevance calculations that you expect (and may well change document ordering in the MSet), but may be worth playing with. I suppose in theory we could have an operator that acts as OP_OR but returns the highest BM25 termweight or something (so the synonyms act as an expansion inside the query, rather than outside as at the moment), but I have no idea if that would be generally useful, or practical with respect to any of the optimisations we do. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
Olly Betts
2008-Jan-03 17:21 UTC
[Xapian-discuss] Question about synonyms and relevancy results.
On Thu, Jan 03, 2008 at 04:18:15PM +0000, James Aylett wrote:> I suppose in theory we could have an operator that acts as OP_OR but > returns the highest BM25 termweight or something (so the synonyms act > as an expansion inside the query, rather than outside as at the > moment), but I have no idea if that would be generally useful, or > practical with respect to any of the optimisations we do.Richard is working on a new OP_SYNONYM operator on SVN branch opsynonym: http://svn.xapian.org/branches/opsynonym/ See also: http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=50 OP_SYNONYM is like OP_OR except that the statistics are calculated as if all the sub-postlists were postings of the same term (some approximations are required to achieve this without the computations being prohibitively expensive). All being well this will be in 1.1.0. Cheers, Olly