On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong
wrote:> In other words, is it possible to avoid duplicates if new
> documents are inserted into the DB by another process in-between
> ->get_mset calls when reusing Xapian::Enquire objects?
The Database object itself effectively does (it works in a snapshot of
the state of the database when you open it, or last called reopen()
which updates that snapshot to what's currently committed).
However we don't currently have any locking of the snapshots that
readers are using so changes made to the database will eventually
invalidate the snapshot - when that happens you'll get a
Xapian::DatabaseModifiedError exception. Typically you'd respond to
that by calling reopen() on the database and retrying the search from
the start, or at least from a point from which you want consistency.
> I do some expensive processing on each mset window, so I always
> limit the results to limit heap usage even if I'm planning on
> going through a big chunk of the DB:
>
> $mset = $enq->get_mset(0, 1000);
> do_something_slow_with_mset($mset);
> $mset = $enq->get_mset(1000, 1000);
> do_something_slow_with_mset($mset);
> $mset = $enq->get_mset(2000, 1000);
> do_something_slow_with_mset($mset);
While the match is running, get_mset(2000, 1000) needs to track
3000 entries so this won't reduce your heap usage (at least not
peak usage).
Is the heap usage problematic?
Looking at the code, for git master each entry is currently:
double weight;
Xapian::docid did;
Xapian::doccount collapse_count;
std::string collapse_key;
std::string sort_key;
We're always going to need the docid, but the other fields aren't
always needed and this could be slimmed down depending on what
options are in use if the size is causing problems. It is as it
is just for simplicity really.
If you're using libstdc++ on a 64-bit architecture, std::string is 32
bytes so that's 80 bytes (1.4.x currently had equivalent fields but in a
different order which incurs 8 bytes of padding to give 88 bytes - I'll
adjust that as this is an internal structure so we can reorder it
without affecting the public ABI). With libc++ on a 64-bit
architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72
currently for 1.4).
For 32-bit architectures, std::string is 24 for libstdc++ or 12 bytes
for libc++, so the total size is 64 or 40 bytes (probably without a
padding overhead in 1.4.x but I can't trivially check that).
If this structure was dynamically sized it could be as little as just
4 bytes per entry for a boolean search, or 12 for a search without
collapsing or sorting on a key (though at least x86-64 wants to align
a double on an 8 byte boundary which means 4 bytes of padding per
entry - that could be avoided by splitting into separate arrays).
> I'm not reusing Xapian::Enquire objects right now since the
> original code was made for HTML pagination and there's no
> guarantee subsequent pages would even hit the same HTTP process.
>
> Now with local batch reports and streaming dumps, reusing the
> Xapian::Enquire object might make sense if duplicates (or skips)
> can be avoided on DBs where another process is writing to it.
Generally the robust way to handle paging across a potentially changing
dataset is to specify the page start based on the data which determines
the order rather than saying "from 1000 results in", but I don't
think
we offer a way to use this approach currently.
You'd probably need to be able to tell Enquire the relevance weight and
document id for the last entry you got, and the search results would
start at the next document with a relevance weight <= that (and if
equal, with document id > that). That works even if that document
has been deleted in the meantime. When sorting by key you'd need to
specify that too.
This would also have the advantage that it would only need to track the
number of entries it's actually been asked to return (not sure about
when collapsing - I'd need to think more about how that would be
implemented in this case).
> Neither query parsing nor setting up the Enquire object seems
> to take a measurable amount of time compared to the work that
> needs to be done with the $mset.
Yeah, Enquire objects are very cheap to construct.
Cheers,
Olly