Hello, is there a way to optimize sorting by certain values for queries which return a huge amount of results? For example, I just want a simple query that gives me the 200 most recent emails out of millions. The elapsed time for get_mset increases as the number of documents ($n * 2000) increases. I suppose I could store a pre-sorted set using SQLite or similar. Thanks in advance for any advice/help you can provide. -----------8<-------- #!/usr/bin/perl -w use strict; use warnings; use Search::Xapian::Document; use Search::Xapian qw/:standard/; use Search::Xapian::WritableDatabase; use File::Temp qw(tempdir); use Time::HiRes qw(clock_gettime CLOCK_MONOTONIC); my $tmp = tempdir('xapian-test-XXXXXXX', CLEANUP => 1, TMPDIR => 1); my $flag = Search::Xapian::DB_CREATE_OR_OPEN; my $xdb = Search::Xapian::WritableDatabase->new($tmp, $flag); my $n = shift || 100; for my $i (0..$n) { $xdb->begin_transaction; for my $j (0..2000) { my $doc = Search::Xapian::Document->new; my $num = Search::Xapian::sortable_serialise(($i * 1000) + $j); $doc->add_value(0, $num); $doc->set_data("$i $j"); $doc->add_boolean_term('T' . 'mail'); $xdb->add_document($doc); $doc = Search::Xapian::Document->new; $doc->add_value(0, $num); $doc->set_data("$i $j"); $doc->add_boolean_term('T' . 'ghost'); $xdb->add_document($doc); } $xdb->commit_transaction; } my $enquire = Search::Xapian::Enquire->new($xdb); my $mail_query = Search::Xapian::Query->new('T' . 'mail'); $enquire->set_query($mail_query); $enquire->set_sort_by_value_then_relevance(0, 1); my $offset = 0; my $limit = 200; my $t0 = clock_gettime(CLOCK_MONOTONIC); my $mset = $enquire->get_mset($offset, $limit); my $t1 = clock_gettime(CLOCK_MONOTONIC); my $elapsed = $t1 - $t0; $xdb = undef; $tmp = undef; print $elapsed, "\n"; __END__
On Fri, Mar 30, 2018 at 05:21:43PM +0000, Eric Wong wrote:> Hello, is there a way to optimize sorting by certain values > for queries which return a huge amount of results?[...]> $enquire->set_sort_by_value_then_relevance(0, 1);If you're just wanting the 200 newest, it'll be faster not to calculate weights, so: $enquire->set_sort_by_value(0, 1); $enquire->set_weighting_scheme(new Xapian::BoolWeight()); For me, this drops the time from ~0.075 seconds to ~0.067 seconds (with xapian-core 1.4.5). If I use xapian git master (still using the glass backend) then it's ~0.051 seconds with weights and ~0.045 seconds without. If I use the new (but still in development) honey backend it's ~0.049 and ~0.044 seconds. But even 0.075 seconds doesn't really seem "slow" to me. What times are you seeing? If it's much slower, I'd make sure you're at least using the latest 1.4.x release. If you do want faster, the simplest solution is to arrange that the document id order matches the document age order, and then you can specify to just sort by that: $enquire->set_weighting_scheme(new Xapian::BoolWeight()); $enquire->set_docid_order(Search::Xapian::ENQ_DESCENDING); That's more like 0.053 seconds for 1.4.5 and 0.021 seconds for git master with glass. The reverse order (ENQ_ASCENDING) is really fast - about 0.0001 seconds. This is because in that case we can just stop once we've found 200 matches. Cheers, Olly
Olly Betts <olly at survex.com> wrote:> On Fri, Mar 30, 2018 at 05:21:43PM +0000, Eric Wong wrote: > > Hello, is there a way to optimize sorting by certain values > > for queries which return a huge amount of results? > [...] > > $enquire->set_sort_by_value_then_relevance(0, 1); > > If you're just wanting the 200 newest, it'll be faster not to calculate > weights, so: > > $enquire->set_sort_by_value(0, 1); > $enquire->set_weighting_scheme(new Xapian::BoolWeight()); > > For me, this drops the time from ~0.075 seconds to ~0.067 seconds (with > xapian-core 1.4.5).Thanks, I can see how that helps.> But even 0.075 seconds doesn't really seem "slow" to me. What times > are you seeing? If it's much slower, I'd make sure you're at least > using the latest 1.4.x release.Roughly what you saw with $n = 100 (the default in my sample script). The problem is time increases with DB size. Setting $n to 1000 makes it roughly 0.750s.> If you do want faster, the simplest solution is to arrange that the > document id order matches the document age order, and then you can > specify to just sort by that: > > $enquire->set_weighting_scheme(new Xapian::BoolWeight()); > $enquire->set_docid_order(Search::Xapian::ENQ_DESCENDING);That would be tricky with emails being delivered out-of-order; not to mention old archives being imported + indexed.> That's more like 0.053 seconds for 1.4.5 and 0.021 seconds for git > master with glass. > > The reverse order (ENQ_ASCENDING) is really fast - about 0.0001 seconds. > This is because in that case we can just stop once we've found 200 > matches.So that sounds like it's O(1) and independent of how many documents are in the mset? Would it be possible to teach Xapian to optimize its storage for certain queries so it can stop once it's found 200 matches?>From what I recall, SQL implementations are pretty good at that.