Olly Betts <olly at survex.com> wrote:> Typically you'd respond to
> that by calling reopen() on the database and retrying the search from
> the start, or at least from a point from which you want consistency.
Thanks, OK, so nothing new, here.
> On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong wrote:
> > I do some expensive processing on each mset window, so I always
> > limit the results to limit heap usage even if I'm planning on
> > going through a big chunk of the DB:
> >
> > $mset = $enq->get_mset(0, 1000);
> > do_something_slow_with_mset($mset);
> > $mset = $enq->get_mset(1000, 1000);
> > do_something_slow_with_mset($mset);
> > $mset = $enq->get_mset(2000, 1000);
> > do_something_slow_with_mset($mset);
>
> While the match is running, get_mset(2000, 1000) needs to track
> 3000 entries so this won't reduce your heap usage (at least not
> peak usage).
>
> Is the heap usage problematic?
Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and
growing) docs in the worst case of a search returning everything.
Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++
you noted.
I realize the offset can cause problems, but it is fairly common
for premature aborts when dumping mboxes over HTTP(S) when the
offset is low and the limit is small.
So a compromise for large streaming downloads could be:
$mset = $enq->get_mset(0, 1000);
do_something_slow_with_mset($mset);
# is reader still alive? get the rest:
$mset = $enq->get_mset(1000, $xdb->get_doccount);
do_something_slow_with_mset($mset);
Batch/admin jobs can just use:
$mset = $enq->get_mset(0, $xdb->get_doccount);
> Looking at the code, for git master each entry is currently:
>
> double weight;
>
> Xapian::docid did;
>
> Xapian::doccount collapse_count;
>
> std::string collapse_key;
>
> std::string sort_key;
<snip>
> If you're using libstdc++ on a 64-bit architecture, std::string is 32
> bytes so that's 80 bytes (1.4.x currently had equivalent fields but in
a
> different order which incurs 8 bytes of padding to give 88 bytes - I'll
> adjust that as this is an internal structure so we can reorder it
> without affecting the public ABI). With libc++ on a 64-bit
> architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72
> currently for 1.4).
<snip>
> If this structure was dynamically sized it could be as little as just
> 4 bytes per entry for a boolean search, or 12 for a search without
> collapsing or sorting on a key (though at least x86-64 wants to align
> a double on an 8 byte boundary which means 4 bytes of padding per
> entry - that could be avoided by splitting into separate arrays).
Yeah, it seems separate arrays would be appropriate since collapse
isn't commonly used AFAIK.
Also, could weight be a 32-bit float instead of 64-bit double?
sidenote: thanks for noting libc++, I didn't know about it
> > I'm not reusing Xapian::Enquire objects right now since the
> > original code was made for HTML pagination and there's no
> > guarantee subsequent pages would even hit the same HTTP process.
> >
> > Now with local batch reports and streaming dumps, reusing the
> > Xapian::Enquire object might make sense if duplicates (or skips)
> > can be avoided on DBs where another process is writing to it.
>
> Generally the robust way to handle paging across a potentially changing
> dataset is to specify the page start based on the data which determines
> the order rather than saying "from 1000 results in", but I
don't think
> we offer a way to use this approach currently.
>
> You'd probably need to be able to tell Enquire the relevance weight and
> document id for the last entry you got, and the search results would
> start at the next document with a relevance weight <= that (and if
> equal, with document id > that). That works even if that document
> has been deleted in the meantime. When sorting by key you'd need to
> specify that too.
So like ->set_cutoff but in the opposite direction?
Thanks.