On Fri, Aug 18, 2023 at 10:41:52AM +0000, Eric Wong
wrote:> Olly Betts <olly at survex.com> wrote:
> > While the match is running, get_mset(2000, 1000) needs to track
> > 3000 entries so this won't reduce your heap usage (at least not
> > peak usage).
> >
> > Is the heap usage problematic?
>
> Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
> growing) docs in the worst case of a search returning everything.
> Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
> you noted.
I suppose for an mbox export you may not be too bothered about order (or
are happy to have the raw order be that in which messages were added)
in which case we only need to track the docid, so that could be just 4
bytes per result which is ~65MB.
Incidentally you don't mind the export order and only have single term
queries you can just use a PostingIterator to get a stream of document
ids matching a particular term (in the order documents were added),
which should use at most ~80KB (per shard if you're using a sharded
database).
> > If this structure was dynamically sized it could be as little as just
> > 4 bytes per entry for a boolean search, or 12 for a search without
> > collapsing or sorting on a key (though at least x86-64 wants to align
> > a double on an 8 byte boundary which means 4 bytes of padding per
> > entry - that could be avoided by splitting into separate arrays).
>
> Yeah, it seems separate arrays would be appropriate since collapse
> isn't commonly used AFAIK.
I think that would probably be a git master only change. I've
reimplemented the matcher since 1.4 and it's simpler to do now because
those 88 byte structures don't move during the match - we used to have a
heap of them but now we instead keep them still and have a heap of
offsets (which is only created if needed, which it only is if we need to
start evicting candidates so it won't be if your MSet contains all
matching results).
We do sort at the end of the match to give the correct MSet order
so that would need to handle reordering elements in each of the split
arrays.
> Also, could weight be a 32-bit float instead of 64-bit double?
This is the structure used during the match, so that would really
require changing to 32-bit float basically everywhere or else you'll get
inconsistencies from the rounding. Not impossible, but it seems a
rather invasive change.
> > Generally the robust way to handle paging across a potentially
changing
> > dataset is to specify the page start based on the data which
determines
> > the order rather than saying "from 1000 results in", but I
don't think
> > we offer a way to use this approach currently.
> >
> > You'd probably need to be able to tell Enquire the relevance
weight and
> > document id for the last entry you got, and the search results would
> > start at the next document with a relevance weight <= that (and if
> > equal, with document id > that). That works even if that document
> > has been deleted in the meantime. When sorting by key you'd need
to
> > specify that too.
>
> So like ->set_cutoff but in the opposite direction?
Yes, but with the addition of a docid to check too when we're exactly on
the weight threshold.
Cheers,
Olly