Hi, I've been using Xapian for a while. But there is a scene I don't know whether supported already. Suppose: 1. Raw query: how to make pizza 2. Parsed query: how AND to AND make AND pizza 3. Documents: d1: how to make pizza at home d2: 3 ways to make pizza d3: make pizza in 4 easy steps Question: 1. During searching process, how to retrieve d2, d3 (although they don't contain "how to")? 2. Even more, how to make sure the score of d1 is higher than d2 or d3 (because d1 does contain "how to")? Many thanks
On Thu, Oct 30, 2014 at 11:37:15AM +0800, Lu Zhen wrote:> I've been using Xapian for a while. But there is a scene I don't know > whether supported already. > > Suppose: > 1. Raw query: how to make pizza > 2. Parsed query: how AND to AND make AND pizza > 3. Documents: > d1: how to make pizza at home > d2: 3 ways to make pizza > d3: make pizza in 4 easy steps > > Question: > 1. During searching process, how to retrieve d2, d3 (although they don't > contain "how to")?Set the default operator in the QueryParser to OP_OR instead of OP_AND: http://xapian.org/docs/apidoc/html/classXapian_1_1QueryParser.html#a2efe48be88c4872afec4bc963f417ea5 The default is actually OP_OR (for historical reasons, though this will probably get changed at some point), so you're presumably currently setting this to OP_AND explicitly. Or you could set "how" and "to" as stopwords, but that fails your second requirement below.> 2. Even more, how to make sure the score of d1 is higher than d2 or d3 > (because d1 does contain "how to")?OP_OR sums the weight contributions from each term present, so this will generally be the case. (Strictly speaking, if you want a 100% guarantee, you'll need to pick a weighting scheme and parameters which will ensure that is always the case - I think BM25 with default parameters doesn't give this, but you'd probably have to create an artificial test case to see d1 not rank higher so I wouldn't worry about it myself.) Cheers, Olly
Thanks for your reply. But if we change the default operator to "OP_OR", it would be much slower when facing lots of documents. I was wondering if exists a way to label some terms of the query "optional", so merging inverted lists would ignore these terms, but at ranking process(cacl scores), these terms do matter. 2014-11-03 13:17 GMT+08:00 Olly Betts <olly at survex.com>:> On Thu, Oct 30, 2014 at 11:37:15AM +0800, Lu Zhen wrote: > > I've been using Xapian for a while. But there is a scene I don't know > > whether supported already. > > > > Suppose: > > 1. Raw query: how to make pizza > > 2. Parsed query: how AND to AND make AND pizza > > 3. Documents: > > d1: how to make pizza at home > > d2: 3 ways to make pizza > > d3: make pizza in 4 easy steps > > > > Question: > > 1. During searching process, how to retrieve d2, d3 (although they don't > > contain "how to")? > > Set the default operator in the QueryParser to OP_OR instead of OP_AND: > > > http://xapian.org/docs/apidoc/html/classXapian_1_1QueryParser.html#a2efe48be88c4872afec4bc963f417ea5 > > The default is actually OP_OR (for historical reasons, though this will > probably get changed at some point), so you're presumably currently > setting this to OP_AND explicitly. > > Or you could set "how" and "to" as stopwords, but that fails your > second requirement below. > > > 2. Even more, how to make sure the score of d1 is higher than d2 or d3 > > (because d1 does contain "how to")? > > OP_OR sums the weight contributions from each term present, so this > will generally be the case. > > (Strictly speaking, if you want a 100% guarantee, you'll need to pick a > weighting scheme and parameters which will ensure that is always the > case - I think BM25 with default parameters doesn't give this, but > you'd probably have to create an artificial test case to see d1 not > rank higher so I wouldn't worry about it myself.) > > Cheers, > Olly >-- ? ? ? ?