thr3ads.net - Xapian discuss - [Xapian-discuss] Revision 11671 cursory observations wrt sort performance [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Henry

2008-Dec-06 15:42 UTC

[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

Hello all,

I've been trying out revision 11671 with Perl Search::Xapian and noted  
the following interesting points:

a)  The more items you add to MultiValueSorter() the slower it  
performs.  I presume this is to be expected, but this could be a  
target for optimisation.  I'm only using serialised numeric values to  
sort on.

On a small (~55k docs, ~562MB) test corpus I get the following  
(Keywords==number of search items):

Keywords   Hits     Time     Sort columns
-----------------------------------------
1          11,600   0.17     0
1          11,600   0.58     1  (~241% slower)
1          11,600   0.82     2
1          11,600   1.17     3

2          15,527   0.23     0
2          15,527   0.77     1  (~234% slower)
2          15,527   1.05     2
2          15,527   1.54     3

As you can see, performance drops _precipitously_ as soon as you start  
sorting on your own fields and not only the internal score.

btw, of no importance, but if you use MultiValueSorter() with no args,  
the process consumes 99% of CPU and doesn't return.

b)  *Is* Xapian sorting through all 11-15k results above?  With  
performance an issue when sorting, I wonder:  I seem to vaguely recall  
an index search approach which roughly did the following:  since the  
user will only ever possibly view (say) 1000 results, why bother  
grinding through all 1 million results (or 10-15k in my tests above)  
to sort, etc?  ie, only gather and collate those results (say, 1000)  
with the highest scores (or those which have a particular 'field'  
above a certain threshold), discarding the rest, but still returning a  
"hit" total of X for display/informational purposes only... or is  
Xapian already doing this?


Cheers
Henry

Henry

2008-Dec-06 16:21 UTC

head link

[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

Quoting myself:> b)  *Is* Xapian sorting through all 11-15k results above?  With
> performance an issue when sorting, I wonder:  I seem to vaguely recall
> an index search approach which roughly did the following:  since the
> user will only ever possibly view (say) 1000 results, why bother
> grinding through all 1 million results (or 10-15k in my tests above)
> to sort, etc?  ie, only gather and collate those results (say, 1000)
> with the highest scores (or those which have a particular 'field'
> above a certain threshold), discarding the rest, but still returning a
> "hit" total of X for display/informational purposes only... or is
> Xapian already doing this?
Apologies for answering myself:  if I understand the docs correctly,  
Enquire::get_mset() looks like the way to go.  It seems to estimate  
totals without considering all matches, but calling the decision  
function mdecider() for every match sounds expensive (in Perl at least).

I'll see tomorrow whether this is more efficient to use.

Regards
Henry

Olly Betts

2008-Dec-07 22:35 UTC

head link

[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

On Sat, Dec 06, 2008 at 05:42:58PM +0200, Henry wrote:> I've been trying out revision 11671 with Perl Search::Xapian and noted
> the following interesting points:
You don't seem to say whether you're using flint or chert, which is a
pretty important distinction.

Benchmarking flint on this stuff probably won't reveal anything very
interesting as we know it's slow when using a lot of values.

The issues which make flint slow here have been addressed in chert.  The
value streams still need to be plumbed more directly to the matcher to
take full advantage of this though.  So relative comparisons are
probably informative, but absolute ones aren't yet.
> a)  The more items you add to MultiValueSorter() the slower it  
> performs.  I presume this is to be expected, but this could be a  
> target for optimisation.  I'm only using serialised numeric values to  
> sort on.
I think this is to be expected - if you sort on two values, that's twice
as much data to fetch (less true for flint, which has to fetch all the
value data for a document together anyway, but it doesn't have to decode
it all still).
> On a small (~55k docs, ~562MB) test corpus I get the following  
> (Keywords==number of search items):
> 
> Keywords   Hits     Time     Sort columns
> -----------------------------------------
> 1          11,600   0.17     0
> 1          11,600   0.58     1  (~241% slower)
> 1          11,600   0.82     2
> 1          11,600   1.17     3
> 
> 2          15,527   0.23     0
> 2          15,527   0.77     1  (~234% slower)
> 2          15,527   1.05     2
> 2          15,527   1.54     3
What does "sort columns" == 0 mean?  Sort by relevance?
> As you can see, performance drops _precipitously_ as soon as you start  
> sorting on your own fields and not only the internal score.
> 
> btw, of no importance, but if you use MultiValueSorter() with no args,  
> the process consumes 99% of CPU and doesn't return.
That'll be a bug then!  Thanks for spotting it.
> b)  *Is* Xapian sorting through all 11-15k results above?  With  
> performance an issue when sorting, I wonder:  I seem to vaguely recall  
> an index search approach which roughly did the following:  since the  
> user will only ever possibly view (say) 1000 results, why bother  
> grinding through all 1 million results (or 10-15k in my tests above)  
> to sort, etc?  ie, only gather and collate those results (say, 1000)  
> with the highest scores (or those which have a particular 'field'  
> above a certain threshold), discarding the rest, but still returning a  
> "hit" total of X for display/informational purposes only... or is
> Xapian already doing this?
I'm not totally sure what you're describing here.

We currently use nth_element() (O(n)) and then only sort the top
documents.  So the sorting part is O(m.log(m)) where m is the size of
the requested M-set.

The code is in matcher/multimatch.cc if you want to look for yourself.

Cheers,
    Olly

Xapian discuss - Dec 2008 - Revision 11671 cursory observations wrt sort performance

[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

[Xapian-discuss] Revision 11671 cursory observations wrt sort performance

[Xapian-discuss] Revision 11671 cursory observations wrt sort performance