张少华
2017-Dec-15 07:10 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
HI, all, I am a user of Xapian, and now I have a problem in using it. After using boolean terms to get some candidates of documents (still too much), we want sorted them by self-defined function which is used in Xapian::KeyMaker->operator(). But how can I get the serialise score in Xapian::MSetIterator object. c++ code likes this: class SortKeyMaker : public Xapian::KeyMaker { std::string operator()(const Xapian::Document& doc) const { double score = self-defined-function(doc); return Xapian::sortable_serialise(score); // How get this value in Xapian::MSetIterator } -- 发自我的网易邮箱手机智能版
Olly Betts
2017-Dec-16 22:11 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
On Fri, Dec 15, 2017 at 03:10:48PM +0800, 张少华 wrote:> After using boolean terms to get some candidates of documents (still > too much), we want sorted them by self-defined function which is used > in Xapian::KeyMaker->operator(). But how can I get the serialise score > in Xapian::MSetIterator object. > > c++ code likes this: > > class SortKeyMaker : public Xapian::KeyMaker { > std::string operator()(const Xapian::Document& doc) const { > double score = self-defined-function(doc); > return Xapian::sortable_serialise(score); // How get this value in Xapian::MSetIterator > }Unfortunately the sort key isn't currently exposed via the public API. It's available internally and it seems like it ought to be accessible but there's no accessor method for it - I can add one but that won't help for existing releases. A possible workaround (and perhaps a better approach) would be to set BoolWeight as the weighting scheme, then feed in your score as a weight using a PostingSource. Then it's available via get_weight() on the MSetIterator object: https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/postingsource.html You may find that's faster because it'll mean sorting by doubles instead of strings. Also you'll get told the lowest weight that's of interest each time you need to calculate a score - if your scoring function is expensive you might be able to skip documents if a cheap calculation can show they can't beat this lowest weight. Cheers, Olly
Olly Betts
2017-Dec-18 22:40 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
On Sat, Dec 16, 2017 at 10:11:40PM +0000, Olly Betts wrote:> Unfortunately the sort key isn't currently exposed via the public API. > It's available internally and it seems like it ought to be accessible > but there's no accessor method for it - I can add one but that won't > help for existing releases.I've added MSetIterator::get_sort_key() to master in 9f807b83ab61a943a355a9ff6733299eab8e6bb1, and backported to the RELEASE/1.4 branch in 93ea6216fe8141d6223c869c6bccb039414db0fa, so this should be in 1.4.6 when that's released. Cheers, Olly
HI, We have an index database of products, about 20 million. We had constructed the title and description of products into posting list, and also stored some values of properties into slot, such as the price, comment count, production date, click number of the products. Now we want select some products which satisties specific condition, such as contain the term of "shirt" and "white", and "price <= 500" and "comment count >= 100", "1000 <= click_number <= 2000". And we have two methods: 1, use xapian.Query for terms and xapian.Query.OP_VALUE_RANGE to filter the value. 2, use xapian.Query for terms to get candidates, then use xapian.MatchDecider to filter the value. Which method get a better performance?
Olly Betts
2018-Jan-12 05:33 UTC
use xapian.Query.OP_VALUE_RANGE or use xapian.MatchDecider?
On Thu, Jan 11, 2018 at 05:11:20PM +0800, 张少华 wrote:> HI, We have an index database of products, about 20 million. We had > constructed the title and description of products into posting list, > and also stored some values of properties into slot, such as the > price, comment count, production date, click number of the products. > > Now we want select some products which satisties specific condition, > such as contain the term of "shirt" and "white", and "price <= 500" > and "comment count >= 100", "1000 <= click_number <= 2000". > > And we have two methods: > 1, use xapian.Query for terms and xapian.Query.OP_VALUE_RANGE to > filter the value. > 2, use xapian.Query for terms to get candidates, then use > xapian.MatchDecider to filter the value. > > Which method get a better performance?Use OP_VALUE_RANGE for a range check on a value - then the matcher actually knows what the check being performed is, which means it can optimise better. That's probably doubly true if you're using Xapian via one of the bindings (which I'm guessing you are from "xapian.Query") since a MatchDecider subclass in that language will require a call between languages for every candidate document considered, and that's likely to be significantly slower than staying within C++. The matcher will try to call it as little as it can, but in cases where a lot of documents match without the filter but the filter rejects most of them it may need to call it millions of times if you have 20 million documents. Cheers, Olly
张少华
2018-Jan-22 16:55 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
>A possible workaround (and perhaps a better approach) would be to >set BoolWeight as the weighting scheme, then feed in your score as >a weight using a PostingSource. Then it's available via get_weight() >on the MSetIterator object: > >https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/postingsource.html > >You may find that's faster because it'll mean sorting by doubles instead >of strings.We realise our score function using PostingSource instead of using KeyMaker, we reference your python example and source code of xapian, the simple demo is here. https://github.com/xiangqianzsh/xapian_leaning/blob/master/postingsource/ExternalWeightPostingSource.h But we found that using PostingSource is more slower than KeyMaker. I think the reason maybe: We only use one Xapian::Query of PostingSource and the upper bound of our get_weight() can not work on a single PostingSource. So some optimizing don't work, but waste time oppositely. How do you think about this? Also, We found the BM25 algorithm is fast in xapian, so we think if we can modify our get_weight() function to adjust the BM25 algorithm. If so, the type of termfreq of document should be double. I am wondering if it works just re-typedef Xapian::termcount to double? Does it has a negative impact on other place of xapian source. Thanks.