thr3ads.net - Xapian discuss - does Xapian::Enquire hold an MVCC revision? [Aug 2023]

If this information is useful, please help other people find it:
Share via:

Eric Wong

2023-Aug-18 10:41 UTC

does Xapian::Enquire hold an MVCC revision?

Olly Betts <olly at survex.com> wrote:> Typically you'd respond to
> that by calling reopen() on the database and retrying the search from
> the start, or at least from a point from which you want consistency.
Thanks, OK, so nothing new, here.
> On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong wrote:
> > I do some expensive processing on each mset window, so I always
> > limit the results to limit heap usage even if I'm planning on
> > going through a big chunk of the DB:
> > 
> > 	$mset = $enq->get_mset(0, 1000);
> > 	do_something_slow_with_mset($mset);
> > 	$mset = $enq->get_mset(1000, 1000);
> > 	do_something_slow_with_mset($mset);
> > 	$mset = $enq->get_mset(2000, 1000);
> > 	do_something_slow_with_mset($mset);
> 
> While the match is running, get_mset(2000, 1000) needs to track
> 3000 entries so this won't reduce your heap usage (at least not
> peak usage).
> 
> Is the heap usage problematic?
Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
growing) docs in the worst case of a search returning everything.
Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
you noted.

I realize the offset can cause problems, but it is fairly common
for premature aborts when dumping mboxes over HTTP(S) when the
offset is low and the limit is small.

So a compromise for large streaming downloads could be:

	$mset = $enq->get_mset(0, 1000);
	do_something_slow_with_mset($mset);

	# is reader still alive? get the rest:

	$mset = $enq->get_mset(1000, $xdb->get_doccount);
	do_something_slow_with_mset($mset);

Batch/admin jobs can just use:

	$mset = $enq->get_mset(0, $xdb->get_doccount);
> Looking at the code, for git master each entry is currently:
> 
>     double weight;
>  
>     Xapian::docid did;
>     
>     Xapian::doccount collapse_count;
> 
>     std::string collapse_key;
>         
>     std::string sort_key;
<snip>
> If you're using libstdc++ on a 64-bit architecture, std::string is 32
> bytes so that's 80 bytes (1.4.x currently had equivalent fields but in
a
> different order which incurs 8 bytes of padding to give 88 bytes - I'll
> adjust that as this is an internal structure so we can reorder it
> without affecting the public ABI).  With libc++ on a 64-bit
> architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72
> currently for 1.4).
<snip>
> If this structure was dynamically sized it could be as little as just
> 4 bytes per entry for a boolean search, or 12 for a search without
> collapsing or sorting on a key (though at least x86-64 wants to align
> a double on an 8 byte boundary which means 4 bytes of padding per
> entry - that could be avoided by splitting into separate arrays).
Yeah, it seems separate arrays would be appropriate since collapse
isn't commonly used AFAIK.

Also, could weight be a 32-bit float instead of 64-bit double?

sidenote: thanks for noting libc++, I didn't know about it
> > I'm not reusing Xapian::Enquire objects right now since the
> > original code was made for HTML pagination and there's no
> > guarantee subsequent pages would even hit the same HTTP process.
> > 
> > Now with local batch reports and streaming dumps, reusing the
> > Xapian::Enquire object might make sense if duplicates (or skips)
> > can be avoided on DBs where another process is writing to it.
> 
> Generally the robust way to handle paging across a potentially changing
> dataset is to specify the page start based on the data which determines
> the order rather than saying "from 1000 results in", but I
don't think
> we offer a way to use this approach currently.
> 
> You'd probably need to be able to tell Enquire the relevance weight and
> document id for the last entry you got, and the search results would
> start at the next document with a relevance weight <= that (and if
> equal, with document id > that).  That works even if that document
> has been deleted in the meantime.  When sorting by key you'd need to
> specify that too.
So like ->set_cutoff but in the opposite direction?

Thanks.

Olly Betts

2023-Aug-18 23:37 UTC

head link

does Xapian::Enquire hold an MVCC revision?

On Fri, Aug 18, 2023 at 10:41:52AM +0000, Eric Wong
wrote:> Olly Betts <olly at survex.com> wrote:
> > While the match is running, get_mset(2000, 1000) needs to track
> > 3000 entries so this won't reduce your heap usage (at least not
> > peak usage).
> > 
> > Is the heap usage problematic?
> 
> Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
> growing) docs in the worst case of a search returning everything.
> Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
> you noted.
I suppose for an mbox export you may not be too bothered about order (or
are happy to have the raw order be that in which messages were added)
in which case we only need to track the docid, so that could be just 4
bytes per result which is ~65MB.

Incidentally you don't mind the export order and only have single term
queries you can just use a PostingIterator to get a stream of document
ids matching a particular term (in the order documents were added),
which should use at most ~80KB (per shard if you're using a sharded
database).
> > If this structure was dynamically sized it could be as little as just
> > 4 bytes per entry for a boolean search, or 12 for a search without
> > collapsing or sorting on a key (though at least x86-64 wants to align
> > a double on an 8 byte boundary which means 4 bytes of padding per
> > entry - that could be avoided by splitting into separate arrays).
> 
> Yeah, it seems separate arrays would be appropriate since collapse
> isn't commonly used AFAIK.
I think that would probably be a git master only change.  I've
reimplemented the matcher since 1.4 and it's simpler to do now because
those 88 byte structures don't move during the match - we used to have a
heap of them but now we instead keep them still and have a heap of
offsets (which is only created if needed, which it only is if we need to
start evicting candidates so it won't be if your MSet contains all
matching results).

We do sort at the end of the match to give the correct MSet order
so that would need to handle reordering elements in each of the split
arrays.
> Also, could weight be a 32-bit float instead of 64-bit double?
This is the structure used during the match, so that would really
require changing to 32-bit float basically everywhere or else you'll get
inconsistencies from the rounding.  Not impossible, but it seems a
rather invasive change.
> > Generally the robust way to handle paging across a potentially
changing
> > dataset is to specify the page start based on the data which
determines
> > the order rather than saying "from 1000 results in", but I
don't think
> > we offer a way to use this approach currently.
> > 
> > You'd probably need to be able to tell Enquire the relevance
weight and
> > document id for the last entry you got, and the search results would
> > start at the next document with a relevance weight <= that (and if
> > equal, with document id > that).  That works even if that document
> > has been deleted in the meantime.  When sorting by key you'd need
to
> > specify that too.
> 
> So like ->set_cutoff but in the opposite direction?
Yes, but with the addition of a docid to check too when we're exactly on
the weight threshold.

Cheers,
    Olly

Xapian discuss - Aug 2023 - does Xapian::Enquire hold an MVCC revision?

does Xapian::Enquire hold an MVCC revision?

does Xapian::Enquire hold an MVCC revision?