On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong wrote:> In other words, is it possible to avoid duplicates if new > documents are inserted into the DB by another process in-between > ->get_mset calls when reusing Xapian::Enquire objects?The Database object itself effectively does (it works in a snapshot of the state of the database when you open it, or last called reopen() which updates that snapshot to what's currently committed). However we don't currently have any locking of the snapshots that readers are using so changes made to the database will eventually invalidate the snapshot - when that happens you'll get a Xapian::DatabaseModifiedError exception. Typically you'd respond to that by calling reopen() on the database and retrying the search from the start, or at least from a point from which you want consistency.> I do some expensive processing on each mset window, so I always > limit the results to limit heap usage even if I'm planning on > going through a big chunk of the DB: > > $mset = $enq->get_mset(0, 1000); > do_something_slow_with_mset($mset); > $mset = $enq->get_mset(1000, 1000); > do_something_slow_with_mset($mset); > $mset = $enq->get_mset(2000, 1000); > do_something_slow_with_mset($mset);While the match is running, get_mset(2000, 1000) needs to track 3000 entries so this won't reduce your heap usage (at least not peak usage). Is the heap usage problematic? Looking at the code, for git master each entry is currently: double weight; Xapian::docid did; Xapian::doccount collapse_count; std::string collapse_key; std::string sort_key; We're always going to need the docid, but the other fields aren't always needed and this could be slimmed down depending on what options are in use if the size is causing problems. It is as it is just for simplicity really. If you're using libstdc++ on a 64-bit architecture, std::string is 32 bytes so that's 80 bytes (1.4.x currently had equivalent fields but in a different order which incurs 8 bytes of padding to give 88 bytes - I'll adjust that as this is an internal structure so we can reorder it without affecting the public ABI). With libc++ on a 64-bit architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72 currently for 1.4). For 32-bit architectures, std::string is 24 for libstdc++ or 12 bytes for libc++, so the total size is 64 or 40 bytes (probably without a padding overhead in 1.4.x but I can't trivially check that). If this structure was dynamically sized it could be as little as just 4 bytes per entry for a boolean search, or 12 for a search without collapsing or sorting on a key (though at least x86-64 wants to align a double on an 8 byte boundary which means 4 bytes of padding per entry - that could be avoided by splitting into separate arrays).> I'm not reusing Xapian::Enquire objects right now since the > original code was made for HTML pagination and there's no > guarantee subsequent pages would even hit the same HTTP process. > > Now with local batch reports and streaming dumps, reusing the > Xapian::Enquire object might make sense if duplicates (or skips) > can be avoided on DBs where another process is writing to it.Generally the robust way to handle paging across a potentially changing dataset is to specify the page start based on the data which determines the order rather than saying "from 1000 results in", but I don't think we offer a way to use this approach currently. You'd probably need to be able to tell Enquire the relevance weight and document id for the last entry you got, and the search results would start at the next document with a relevance weight <= that (and if equal, with document id > that). That works even if that document has been deleted in the meantime. When sorting by key you'd need to specify that too. This would also have the advantage that it would only need to track the number of entries it's actually been asked to return (not sure about when collapsing - I'd need to think more about how that would be implemented in this case).> Neither query parsing nor setting up the Enquire object seems > to take a measurable amount of time compared to the work that > needs to be done with the $mset.Yeah, Enquire objects are very cheap to construct. Cheers, Olly
Olly Betts <olly at survex.com> wrote:> Typically you'd respond to > that by calling reopen() on the database and retrying the search from > the start, or at least from a point from which you want consistency.Thanks, OK, so nothing new, here.> On Thu, Aug 17, 2023 at 09:28:26PM +0000, Eric Wong wrote: > > I do some expensive processing on each mset window, so I always > > limit the results to limit heap usage even if I'm planning on > > going through a big chunk of the DB: > > > > $mset = $enq->get_mset(0, 1000); > > do_something_slow_with_mset($mset); > > $mset = $enq->get_mset(1000, 1000); > > do_something_slow_with_mset($mset); > > $mset = $enq->get_mset(2000, 1000); > > do_something_slow_with_mset($mset); > > While the match is running, get_mset(2000, 1000) needs to track > 3000 entries so this won't reduce your heap usage (at least not > peak usage). > > Is the heap usage problematic?Yes, roughly ~1.3GB (in a Perl process) for ~17 million (and growing) docs in the worst case of a search returning everything. Those numbers appears inline with the 88 bytes w/ 64-bit libstdc++ you noted. I realize the offset can cause problems, but it is fairly common for premature aborts when dumping mboxes over HTTP(S) when the offset is low and the limit is small. So a compromise for large streaming downloads could be: $mset = $enq->get_mset(0, 1000); do_something_slow_with_mset($mset); # is reader still alive? get the rest: $mset = $enq->get_mset(1000, $xdb->get_doccount); do_something_slow_with_mset($mset); Batch/admin jobs can just use: $mset = $enq->get_mset(0, $xdb->get_doccount);> Looking at the code, for git master each entry is currently: > > double weight; > > Xapian::docid did; > > Xapian::doccount collapse_count; > > std::string collapse_key; > > std::string sort_key;<snip>> If you're using libstdc++ on a 64-bit architecture, std::string is 32 > bytes so that's 80 bytes (1.4.x currently had equivalent fields but in a > different order which incurs 8 bytes of padding to give 88 bytes - I'll > adjust that as this is an internal structure so we can reorder it > without affecting the public ABI). With libc++ on a 64-bit > architecture, std::string is 24 bytes so it'll be 64 bytes total (or 72 > currently for 1.4).<snip>> If this structure was dynamically sized it could be as little as just > 4 bytes per entry for a boolean search, or 12 for a search without > collapsing or sorting on a key (though at least x86-64 wants to align > a double on an 8 byte boundary which means 4 bytes of padding per > entry - that could be avoided by splitting into separate arrays).Yeah, it seems separate arrays would be appropriate since collapse isn't commonly used AFAIK. Also, could weight be a 32-bit float instead of 64-bit double? sidenote: thanks for noting libc++, I didn't know about it> > I'm not reusing Xapian::Enquire objects right now since the > > original code was made for HTML pagination and there's no > > guarantee subsequent pages would even hit the same HTTP process. > > > > Now with local batch reports and streaming dumps, reusing the > > Xapian::Enquire object might make sense if duplicates (or skips) > > can be avoided on DBs where another process is writing to it. > > Generally the robust way to handle paging across a potentially changing > dataset is to specify the page start based on the data which determines > the order rather than saying "from 1000 results in", but I don't think > we offer a way to use this approach currently. > > You'd probably need to be able to tell Enquire the relevance weight and > document id for the last entry you got, and the search results would > start at the next document with a relevance weight <= that (and if > equal, with document id > that). That works even if that document > has been deleted in the meantime. When sorting by key you'd need to > specify that too.So like ->set_cutoff but in the opposite direction? Thanks.