Henry
2008-Dec-06 15:42 UTC
[Xapian-discuss] Revision 11671 cursory observations wrt sort performance
Hello all, I've been trying out revision 11671 with Perl Search::Xapian and noted the following interesting points: a) The more items you add to MultiValueSorter() the slower it performs. I presume this is to be expected, but this could be a target for optimisation. I'm only using serialised numeric values to sort on. On a small (~55k docs, ~562MB) test corpus I get the following (Keywords==number of search items): Keywords Hits Time Sort columns ----------------------------------------- 1 11,600 0.17 0 1 11,600 0.58 1 (~241% slower) 1 11,600 0.82 2 1 11,600 1.17 3 2 15,527 0.23 0 2 15,527 0.77 1 (~234% slower) 2 15,527 1.05 2 2 15,527 1.54 3 As you can see, performance drops _precipitously_ as soon as you start sorting on your own fields and not only the internal score. btw, of no importance, but if you use MultiValueSorter() with no args, the process consumes 99% of CPU and doesn't return. b) *Is* Xapian sorting through all 11-15k results above? With performance an issue when sorting, I wonder: I seem to vaguely recall an index search approach which roughly did the following: since the user will only ever possibly view (say) 1000 results, why bother grinding through all 1 million results (or 10-15k in my tests above) to sort, etc? ie, only gather and collate those results (say, 1000) with the highest scores (or those which have a particular 'field' above a certain threshold), discarding the rest, but still returning a "hit" total of X for display/informational purposes only... or is Xapian already doing this? Cheers Henry
Henry
2008-Dec-06 16:21 UTC
[Xapian-discuss] Revision 11671 cursory observations wrt sort performance
Quoting myself:> b) *Is* Xapian sorting through all 11-15k results above? With > performance an issue when sorting, I wonder: I seem to vaguely recall > an index search approach which roughly did the following: since the > user will only ever possibly view (say) 1000 results, why bother > grinding through all 1 million results (or 10-15k in my tests above) > to sort, etc? ie, only gather and collate those results (say, 1000) > with the highest scores (or those which have a particular 'field' > above a certain threshold), discarding the rest, but still returning a > "hit" total of X for display/informational purposes only... or is > Xapian already doing this?Apologies for answering myself: if I understand the docs correctly, Enquire::get_mset() looks like the way to go. It seems to estimate totals without considering all matches, but calling the decision function mdecider() for every match sounds expensive (in Perl at least). I'll see tomorrow whether this is more efficient to use. Regards Henry
Olly Betts
2008-Dec-07 22:35 UTC
[Xapian-discuss] Revision 11671 cursory observations wrt sort performance
On Sat, Dec 06, 2008 at 05:42:58PM +0200, Henry wrote:> I've been trying out revision 11671 with Perl Search::Xapian and noted > the following interesting points:You don't seem to say whether you're using flint or chert, which is a pretty important distinction. Benchmarking flint on this stuff probably won't reveal anything very interesting as we know it's slow when using a lot of values. The issues which make flint slow here have been addressed in chert. The value streams still need to be plumbed more directly to the matcher to take full advantage of this though. So relative comparisons are probably informative, but absolute ones aren't yet.> a) The more items you add to MultiValueSorter() the slower it > performs. I presume this is to be expected, but this could be a > target for optimisation. I'm only using serialised numeric values to > sort on.I think this is to be expected - if you sort on two values, that's twice as much data to fetch (less true for flint, which has to fetch all the value data for a document together anyway, but it doesn't have to decode it all still).> On a small (~55k docs, ~562MB) test corpus I get the following > (Keywords==number of search items): > > Keywords Hits Time Sort columns > ----------------------------------------- > 1 11,600 0.17 0 > 1 11,600 0.58 1 (~241% slower) > 1 11,600 0.82 2 > 1 11,600 1.17 3 > > 2 15,527 0.23 0 > 2 15,527 0.77 1 (~234% slower) > 2 15,527 1.05 2 > 2 15,527 1.54 3What does "sort columns" == 0 mean? Sort by relevance?> As you can see, performance drops _precipitously_ as soon as you start > sorting on your own fields and not only the internal score. > > btw, of no importance, but if you use MultiValueSorter() with no args, > the process consumes 99% of CPU and doesn't return.That'll be a bug then! Thanks for spotting it.> b) *Is* Xapian sorting through all 11-15k results above? With > performance an issue when sorting, I wonder: I seem to vaguely recall > an index search approach which roughly did the following: since the > user will only ever possibly view (say) 1000 results, why bother > grinding through all 1 million results (or 10-15k in my tests above) > to sort, etc? ie, only gather and collate those results (say, 1000) > with the highest scores (or those which have a particular 'field' > above a certain threshold), discarding the rest, but still returning a > "hit" total of X for display/informational purposes only... or is > Xapian already doing this?I'm not totally sure what you're describing here. We currently use nth_element() (O(n)) and then only sort the top documents. So the sorting part is O(m.log(m)) where m is the size of the requested M-set. The code is in matcher/multimatch.cc if you want to look for yourself. Cheers, Olly