Olly Betts <olly at survex.com> wrote:> On Fri, Mar 30, 2018 at 05:21:43PM +0000, Eric Wong wrote: > > Hello, is there a way to optimize sorting by certain values > > for queries which return a huge amount of results? > [...] > > $enquire->set_sort_by_value_then_relevance(0, 1); > > If you're just wanting the 200 newest, it'll be faster not to calculate > weights, so: > > $enquire->set_sort_by_value(0, 1); > $enquire->set_weighting_scheme(new Xapian::BoolWeight()); > > For me, this drops the time from ~0.075 seconds to ~0.067 seconds (with > xapian-core 1.4.5).Thanks, I can see how that helps.> But even 0.075 seconds doesn't really seem "slow" to me. What times > are you seeing? If it's much slower, I'd make sure you're at least > using the latest 1.4.x release.Roughly what you saw with $n = 100 (the default in my sample script). The problem is time increases with DB size. Setting $n to 1000 makes it roughly 0.750s.> If you do want faster, the simplest solution is to arrange that the > document id order matches the document age order, and then you can > specify to just sort by that: > > $enquire->set_weighting_scheme(new Xapian::BoolWeight()); > $enquire->set_docid_order(Search::Xapian::ENQ_DESCENDING);That would be tricky with emails being delivered out-of-order; not to mention old archives being imported + indexed.> That's more like 0.053 seconds for 1.4.5 and 0.021 seconds for git > master with glass. > > The reverse order (ENQ_ASCENDING) is really fast - about 0.0001 seconds. > This is because in that case we can just stop once we've found 200 > matches.So that sounds like it's O(1) and independent of how many documents are in the mset? Would it be possible to teach Xapian to optimize its storage for certain queries so it can stop once it's found 200 matches?>From what I recall, SQL implementations are pretty good at that.
On Sat, Mar 31, 2018 at 12:58:19AM +0000, Eric Wong wrote:> Olly Betts <olly at survex.com> wrote: > > If you're just wanting the 200 newest, it'll be faster not to calculate > > weights, so: > > > > $enquire->set_sort_by_value(0, 1); > > $enquire->set_weighting_scheme(new Xapian::BoolWeight()); > > > > For me, this drops the time from ~0.075 seconds to ~0.067 seconds (with > > xapian-core 1.4.5). > > Thanks, I can see how that helps. > > > But even 0.075 seconds doesn't really seem "slow" to me. What times > > are you seeing? If it's much slower, I'd make sure you're at least > > using the latest 1.4.x release. > > Roughly what you saw with $n = 100 (the default in my sample > script). The problem is time increases with DB size. Setting > $n to 1000 makes it roughly 0.750s.That's what I'd expect - currently sorting by value will tend to a linear limit as the total number of matches increases.> > If you do want faster, the simplest solution is to arrange that the > > document id order matches the document age order, and then you can > > specify to just sort by that: > > > > $enquire->set_weighting_scheme(new Xapian::BoolWeight()); > > $enquire->set_docid_order(Search::Xapian::ENQ_DESCENDING); > > That would be tricky with emails being delivered out-of-order; > not to mention old archives being imported + indexed.This was the trick we used with search.gmane.org. We mostly ignored the archive import issue there, which wasn't ideal but a periodic reindex cleaned it up. You can probably assume that out-of-order delivery won't be more than a certain amount out of order (and that small infelicities in date order aren't a big deal, especially as "Date:" headers may come from skewed clocks and threads give you ordering information that can contradict "Date:" header order).> > That's more like 0.053 seconds for 1.4.5 and 0.021 seconds for git > > master with glass. > > > > The reverse order (ENQ_ASCENDING) is really fast - about 0.0001 seconds. > > This is because in that case we can just stop once we've found 200 > > matches. > > So that sounds like it's O(1) and independent of how many > documents are in the mset?It won't be independent of the number of documents in the MSet - something needs to be done for each document added to the MSet, so clearly it'll be at least O(n) in that, but if n is fixed at 200 then the O() complexity isn't relevant. It'll be fairly independent of the total number of documents that match (and so of $n in your script).> Would it be possible to teach Xapian to optimize its storage for > certain queries so it can stop once it's found 200 matches? > From what I recall, SQL implementations are pretty good at that.Probably - e.g. tracking some sort of buckets for values would help here as well as for some other uses of values. Cheers, Olly
> > Olly Betts <olly at survex.com> wrote: > > > > > > The reverse order (ENQ_ASCENDING) is really fast - about 0.0001 seconds. > > > This is because in that case we can just stop once we've found 200 > > > matches.With a few million documents, that ENQ_ASCENDING sounds promising :) So, it looks like if I had ideal ordering, I could do something along the lines of: my $doc_id = $db->get_metadata('last_doc_id') || 0xffffffff; $db->replace_document($doc_id--, $_) foreach (@doc); $db->set_metadata('last_doc_id', $doc_id); And get killer performance. Olly Betts <olly at survex.com> wrote:> On Sat, Mar 31, 2018 at 12:58:19AM +0000, Eric Wong wrote: > > Would it be possible to teach Xapian to optimize its storage for > > certain queries so it can stop once it's found 200 matches? > > From what I recall, SQL implementations are pretty good at that. > > Probably - e.g. tracking some sort of buckets for values would help here > as well as for some other uses of values.Alright. I'll keep an eye out for that in coming years. I've moved some queries (OVER/XOVER) to SQLite which is already a dependency of public-inbox (sometimes it needs to sort by article number, sometimes it sorts by timestamp).