张少华
2018-Jan-22 16:55 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
>A possible workaround (and perhaps a better approach) would be to >set BoolWeight as the weighting scheme, then feed in your score as >a weight using a PostingSource. Then it's available via get_weight() >on the MSetIterator object: > >https://getting-started-with-xapian.readthedocs.io/en/latest/advanced/postingsource.html > >You may find that's faster because it'll mean sorting by doubles instead >of strings.We realise our score function using PostingSource instead of using KeyMaker, we reference your python example and source code of xapian, the simple demo is here. https://github.com/xiangqianzsh/xapian_leaning/blob/master/postingsource/ExternalWeightPostingSource.h But we found that using PostingSource is more slower than KeyMaker. I think the reason maybe: We only use one Xapian::Query of PostingSource and the upper bound of our get_weight() can not work on a single PostingSource. So some optimizing don't work, but waste time oppositely. How do you think about this? Also, We found the BM25 algorithm is fast in xapian, so we think if we can modify our get_weight() function to adjust the BM25 algorithm. If so, the type of termfreq of document should be double. I am wondering if it works just re-typedef Xapian::termcount to double? Does it has a negative impact on other place of xapian source. Thanks.
Olly Betts
2018-Jan-24 02:46 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
On Tue, Jan 23, 2018 at 12:55:31AM +0800, 张少华 wrote:> We realise our score function using PostingSource instead of using > KeyMaker, we reference your python example and source code of xapian, > the simple demo is here. > https://github.com/xiangqianzsh/xapian_leaning/blob/master/postingsource/ExternalWeightPostingSource.hI'd just put the get_weight() and get_maxweight() implementations into your ExternalWeightPostingSource class - the WeightSource class doesn't seem to serve a useful purpose and just adds virtual method call overheads (and those can add up, unless the compiler can devirtualise the calls, which compilers are getting better at). In the python example, the WeightSource class is just meant to be a placeholder for "some source of weights" - it isn't meant to be a literal recommendation for how to write such a class.> But we found that using PostingSource is more slower than KeyMaker.What's the relative speed difference you're seeing?> I think the reason maybe: We only use one Xapian::Query of > PostingSource and the upper bound of our get_weight() can not work on > a single PostingSource. So some optimizing don't work, but waste > time oppositely. How do you think about this?If I follow, you're saying your query is just this an ExternalWeightPostingSource object? If so, what is the query in the KeyMaker case? I'd expect a KeyMaker to also be fairly slow if the query is Query::MatchAll or similar as the sort key will need building for every matching document, like how the PostingSource will need to calculate the weight for every matching document.> Also, We found the BM25 algorithm is fast in xapian, so we think if we > can modify our get_weight() function to adjust the BM25 algorithm. If > so, the type of termfreq of document should be double. I am wondering > if it works just re-typedef Xapian::termcount to double? Does it has a > negative impact on other place of xapian source.It'll stop it compiling, which is fairly negative. Xapian::termcount needs to be an unsigned integer, and there are assertions to that effect you'd hit. I'd think it would be a significant project to change that. Implementing your weighting as a Xapian::Weight subclass is a potential option if it works as a sum of weight components from the terms in the query. But if you need to make Xapian::termcount a floating point type to do it then I suspect this isn't a good approach. Cheers, Olly
张少华
2018-Jan-30 16:30 UTC
How to get the serialise score returned in Xapian::KeyMaker->operator().
> What's the relative speed difference you're seeing?I have written a demo to compare the performance of PostingSource and KeyMaker. You can see the detail by this link. https://github.com/xiangqianzsh/xapian_leaning/tree/master/compare_keymaker_and_postingsource We generate 30 million documents, While searching, we first use one term (for example t1) to choose some documents, and then sort them descending using 0.5 * doc.get_value(1), i.e 0.5 * score. By comparison, the time costed by PostingSource is 6 times greater than KeyMaker.>If I follow, you're saying your query is just this an ExternalWeightPostingSource object?When we use ExternalWeightPostingSource, we first use Query (const std::string &term, Xapian::termcount wqf=1, Xapian::termpos pos=0) to choose some documents, and use Query (Xapian::PostingSource *source) to sort our documents. We join them together by Xapian::Query(Xapian::Query::OP_AND_MAYBE, query, query_extwps). We get the weight of our documents by get_weight() function, but we cannot estimate the maximum of the weight if we don't check all the documents. So we set the max_weight a large number. If we can estimate the upper bound of get_weight() in our case, it may also not work well unless some documents' weight is exactly equals to the upper bound. For example, the upper bound is 1, and you want to get top10 documents, but only 9 documents' weight is 1, so you have to check all the documents to choose the top10 documents. That's why I written 'a single PostingSource' in my last email.
Reasonably Related Threads
- How to get the serialise score returned in Xapian::KeyMaker->operator().
- How to get the serialise score returned in Xapian::KeyMaker->operator().
- How to get the serialise score returned in Xapian::KeyMaker->operator().
- How to get the serialise score returned in Xapian::KeyMaker->operator().
- How to get the serialise score returned in Xapian::KeyMaker->operator().