thr3ads.net - Xapian discuss - does Xapian::Enquire hold an MVCC revision? [Aug 2023]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2023-Aug-18 02:00 UTC

does Xapian::Enquire hold an MVCC revision?

On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong
wrote:> In other words, is it possible to avoid duplicates if new
> documents are inserted into the DB by another process in-between
> ->get_mset calls when reusing Xapian::Enquire objects?
The Database object itself effectively does (it works in a snapshot of
the state of the database when you open it, or last called reopen()
which updates that snapshot to what's currently committed).

However we don't currently have any locking of the snapshots that
readers are using so changes made to the database will eventually
invalidate the snapshot - when that happens you'll get a
Xapian::DatabaseModifiedError exception.  Typically you'd respond to
that by calling reopen() on the database and retrying the search from
the start, or at least from a point from which you want consistency.
> I do some expensive processing on each mset window, so I always
> limit the results to limit heap usage even if I'm planning on
> going through a big chunk of the DB:
> 
> 	$mset = $enq->get_mset(0, 1000);
> 	do_something_slow_with_mset($mset);
> 	$mset = $enq->get_mset(1000, 1000);
> 	do_something_slow_with_mset($mset);
> 	$mset = $enq->get_mset(2000, 1000);
> 	do_something_slow_with_mset($mset);
While the match is running, get_mset(2000, 1000) needs to track
3000 entries so this won't reduce your heap usage (at least not
peak usage).

Is the heap usage problematic?

Looking at the code, for git master each entry is currently:

    double weight;

    Xapian::docid did;

    Xapian::doccount collapse_count;

    std::string collapse_key;

    std::string sort_key;

We're always going to need the docid, but the other fields aren't
always needed and this could be slimmed down depending on what
options are in use if the size is causing problems.  It is as it
is just for simplicity really.

If you're using libstdc++ on a 64-bit architecture, std::string is 32
bytes so that's 80 bytes (1.4.x currently had equivalent fields but in a
different order which incurs 8 bytes of padding to give 88 bytes - I'll
adjust that as this is an internal structure so we can reorder it
without affecting the public ABI).  With libc++ on a 64-bit
architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72
currently for 1.4).

For 32-bit architectures, std::string is 24 for libstdc++ or 12 bytes
for libc++, so the total size is 64 or 40 bytes (probably without a
padding overhead in 1.4.x but I can't trivially check that).

If this structure was dynamically sized it could be as little as just
4 bytes per entry for a boolean search, or 12 for a search without
collapsing or sorting on a key (though at least x86-64 wants to align
a double on an 8 byte boundary which means 4 bytes of padding per
entry - that could be avoided by splitting into separate arrays).
> I'm not reusing Xapian::Enquire objects right now since the
> original code was made for HTML pagination and there's no
> guarantee subsequent pages would even hit the same HTTP process.
> 
> Now with local batch reports and streaming dumps, reusing the
> Xapian::Enquire object might make sense if duplicates (or skips)
> can be avoided on DBs where another process is writing to it.
Generally the robust way to handle paging across a potentially changing
dataset is to specify the page start based on the data which determines
the order rather than saying "from 1000 results in", but I don't
think
we offer a way to use this approach currently.

You'd probably need to be able to tell Enquire the relevance weight and
document id for the last entry you got, and the search results would
start at the next document with a relevance weight <= that (and if
equal, with document id > that).  That works even if that document
has been deleted in the meantime.  When sorting by key you'd need to
specify that too.

This would also have the advantage that it would only need to track the
number of entries it's actually been asked to return (not sure about
when collapsing - I'd need to think more about how that would be
implemented in this case).
> Neither query parsing nor setting up the Enquire object seems
> to take a measurable amount of time compared to the work that
> needs to be done with the $mset.
Yeah, Enquire objects are very cheap to construct.

Cheers,
    Olly

Eric Wong

2023-Aug-18 10:41 UTC

head link

does Xapian::Enquire hold an MVCC revision?

Olly Betts <olly at survex.com> wrote:> Typically you'd respond to
> that by calling reopen() on the database and retrying the search from
> the start, or at least from a point from which you want consistency.
Thanks, OK, so nothing new, here.
> On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong wrote:
> > I do some expensive processing on each mset window, so I always
> > limit the results to limit heap usage even if I'm planning on
> > going through a big chunk of the DB:
> > 
> > 	$mset = $enq->get_mset(0, 1000);
> > 	do_something_slow_with_mset($mset);
> > 	$mset = $enq->get_mset(1000, 1000);
> > 	do_something_slow_with_mset($mset);
> > 	$mset = $enq->get_mset(2000, 1000);
> > 	do_something_slow_with_mset($mset);
> 
> While the match is running, get_mset(2000, 1000) needs to track
> 3000 entries so this won't reduce your heap usage (at least not
> peak usage).
> 
> Is the heap usage problematic?
Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
growing) docs in the worst case of a search returning everything.
Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
you noted.

I realize the offset can cause problems, but it is fairly common
for premature aborts when dumping mboxes over HTTP(S) when the
offset is low and the limit is small.

So a compromise for large streaming downloads could be:

	$mset = $enq->get_mset(0, 1000);
	do_something_slow_with_mset($mset);

	# is reader still alive? get the rest:

	$mset = $enq->get_mset(1000, $xdb->get_doccount);
	do_something_slow_with_mset($mset);

Batch/admin jobs can just use:

	$mset = $enq->get_mset(0, $xdb->get_doccount);
> Looking at the code, for git master each entry is currently:
> 
>     double weight;
>  
>     Xapian::docid did;
>     
>     Xapian::doccount collapse_count;
> 
>     std::string collapse_key;
>         
>     std::string sort_key;
<snip>
> If you're using libstdc++ on a 64-bit architecture, std::string is 32
> bytes so that's 80 bytes (1.4.x currently had equivalent fields but in
a
> different order which incurs 8 bytes of padding to give 88 bytes - I'll
> adjust that as this is an internal structure so we can reorder it
> without affecting the public ABI).  With libc++ on a 64-bit
> architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72
> currently for 1.4).
<snip>
> If this structure was dynamically sized it could be as little as just
> 4 bytes per entry for a boolean search, or 12 for a search without
> collapsing or sorting on a key (though at least x86-64 wants to align
> a double on an 8 byte boundary which means 4 bytes of padding per
> entry - that could be avoided by splitting into separate arrays).
Yeah, it seems separate arrays would be appropriate since collapse
isn't commonly used AFAIK.

Also, could weight be a 32-bit float instead of 64-bit double?

sidenote: thanks for noting libc++, I didn't know about it
> > I'm not reusing Xapian::Enquire objects right now since the
> > original code was made for HTML pagination and there's no
> > guarantee subsequent pages would even hit the same HTTP process.
> > 
> > Now with local batch reports and streaming dumps, reusing the
> > Xapian::Enquire object might make sense if duplicates (or skips)
> > can be avoided on DBs where another process is writing to it.
> 
> Generally the robust way to handle paging across a potentially changing
> dataset is to specify the page start based on the data which determines
> the order rather than saying "from 1000 results in", but I
don't think
> we offer a way to use this approach currently.
> 
> You'd probably need to be able to tell Enquire the relevance weight and
> document id for the last entry you got, and the search results would
> start at the next document with a relevance weight <= that (and if
> equal, with document id > that).  That works even if that document
> has been deleted in the meantime.  When sorting by key you'd need to
> specify that too.
So like ->set_cutoff but in the opposite direction?

Thanks.

Xapian discuss - Aug 2023 - does Xapian::Enquire hold an MVCC revision?

does Xapian::Enquire hold an MVCC revision?

does Xapian::Enquire hold an MVCC revision?