Bron Gondwana
2013-Jun-19 14:07 UTC
[Xapian-discuss] Compact databases and removing stale records at the same time
On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:> On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote: > > The advantage of compact - it runs approximately 8 times as fast (we > > are CPU limited in each case - writing to tmpfs first, then rsyncing > > to the destination) and it takes approximately 75% of the space of a > > fresh database with maximum compaction. > > > > The downside of compact - can't delete things (or at least I can't see > > how). > > A lot of the reason why compact is fast is because it pretty much just > treats the contents of each posting list chunk as opaque data (if it > renumbers, it has to adjust the header of the first chunk from each > postlist, if I remember correctly).Yeah, fair enough!> In order to be able to delete documents as it went, it would have to > modify any postlist chunks which contained those documents. That's > possible, but adds complexity to the compaction code, and will probably > lose most of the speed advantages.I figured the bigger problem was actually garbage collecting the terms which didn't have references any more - in my quick glance through the code. I admit I don't understand how it all works quite as well as I'd like.> The destination of a document-by-document copy should be close to > compact for most of the tables. If changes were flushed during the > copy, the postlist table may still benefit from compaction (if there > was only one batch, then the postlist table should be compact too).Well, I've switched to a single pass without all the transactional foo (see pasted below) It still compacts a lot better with compact: [brong at imap14 brong]$ du -s * 1198332 xapian.57 [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp compressing data:57 to data:58 for user.brong (active temp:92,archive:3,meta:0,data:57) compacting databases building cyrus.indexed.db copying from tempdir to destination renaming tempdir into place finished compact of user.brong (active temp:92,archive:3,meta:0,data:58) real 1m23.956s user 0m32.604s sys 0m5.948s [brong at imap14 brong]$ du -s * 759992 xapian.58 That's about 75% of the uncompacted size.> > catch (const Xapian::Error &err) { > > syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s", > > err.get_context().c_str(), err.get_description().c_str()); > > If err has a context, err.get_description() will actually include it.Heh. That's code I inherited and hadn't even looked at. I don't think I've ever actually seen it called. I'll simplify it.> > /* copy all matching documents to the new DB */ > > for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) { > > Xapian::Document doc = i.get_document(); > > This requires creating an in-memory structure of size get_doccount(), so > won't scale well to really big databases.My test DB is about 90k documents. Lots of terms though, particularly some of the emails which contain thousands of lines of syslog output. [brong at imap14 brong]$ delve -1 -a xapian.58 | wc -l 6370721 [brong at imap14 brong]$ delve -1 -V0 xapian.58 | wc -l 89419> But there's no need to run a match just to be able to iterate all the > [...] > There's no need to use transactions to do this - outside of > [...]v2: try { /* set up a cursor to read from all the source databases */ Xapian::Database srcdb = Xapian::Database(); while (*sources) { srcdb.add_database(Xapian::Database(*sources++)); } /* create a destination database */ Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest, Xapian::DB_CREATE); /* copy all matching documents to the new DB */ Xapian::PostingIterator it; for (it = srcdb.postlist_begin(""); it != srcdb.postlist_end(""); it++) { Xapian::docid did = *it; Xapian::Document doc = srcdb.get_document(did); std::string cyrusid = doc.get_value(SLOT_CYRUSID); if (cb(cyrusid.c_str(), rock)) { destdb.add_document(doc); } } /* commit all changes explicitly */ destdb.commit(); } FYI: SLOT_CYRUSID is just 0. Thanks heaps for your help on this. Honestly, it's not a deal-breaker for us to use this much CPU. It's a pain, but it's still heaps cheaper than re-indexing everything, and our servers are IO bound more than CPU bound, so eating a bit more CPU is survivable. Bron. -- Bron Gondwana brong at fastmail.fm
Olly Betts
2013-Jun-20 00:24 UTC
[Xapian-discuss] Compact databases and removing stale records at the same time
On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana wrote:> On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote: > > In order to be able to delete documents as it went, it would have to > > modify any postlist chunks which contained those documents. That's > > possible, but adds complexity to the compaction code, and will probably > > lose most of the speed advantages. > > I figured the bigger problem was actually garbage collecting the terms > which didn't have references any more - in my quick glance through the > code. I admit I don't understand how it all works quite as well as I'd > like.Each term has a chunked list of postings (which are (docid, wdf) pairs) so there's not really much to the "garbage collecting" part - if that list is empty, the term is no longer present in the database.> > The destination of a document-by-document copy should be close to > > compact for most of the tables. If changes were flushed during the > > copy, the postlist table may still benefit from compaction (if there > > was only one batch, then the postlist table should be compact too). > > Well, I've switched to a single pass without all the transactional foo > (see pasted below) > > It still compacts a lot better with compact: > > [brong at imap14 brong]$ du -s * > 1198332 xapian.57 > [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp[...]> [brong at imap14 brong]$ du -s * > 759992 xapian.58How does that break down by table though? Looking at the sizes of the corresponding .DB files before and after will give you most of this info (the base files are much smaller, and essentially proportional in size).> Xapian::Database srcdb = Xapian::Database(); > while (*sources) { > srcdb.add_database(Xapian::Database(*sources++)); > } > > /* create a destination database */ > Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest, Xapian::DB_CREATE); > > /* copy all matching documents to the new DB */ > Xapian::PostingIterator it; > for (it = srcdb.postlist_begin(""); it != srcdb.postlist_end(""); it++) { > Xapian::docid did = *it; > Xapian::Document doc = srcdb.get_document(did); > std::string cyrusid = doc.get_value(SLOT_CYRUSID); > if (cb(cyrusid.c_str(), rock)) { > destdb.add_document(doc); > } > }With multiple databases as above, the docids are interleaved, so it might be worth trying to open each source and copy its documents to destdb in turn for better locality of reference, and so better cache use. That's assuming the raw docid order doesn't matter to you. Is the CYRUSID value always non-empty? If it is, you can actually iterate that stream of values directly - something like: Xapian::ValueIterator it; for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it != srcdb.valuestream_end(SLOT_CYRUSID); it++) { if (cb((*it).c_str(), rock)) { Xapian::docid did = it->get_docid(); Xapian::Document doc = srcdb.get_document(did); destdb.add_document(doc); } } This will omit any documents with an empty value in SLOT_CYRUSID though (there's no distinction between an empty and unset value). I suspect the document copying actually takes most of the time here, unless you're discarding a lot of them. Cheers, Olly
Bron Gondwana
2013-Jun-20 11:57 UTC
[Xapian-discuss] Compact databases and removing stale records at the same time
On Thu, Jun 20, 2013, at 10:24 AM, Olly Betts wrote:> On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana wrote: > > On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote: > > > In order to be able to delete documents as it went, it would have to > > > modify any postlist chunks which contained those documents. That's > > > possible, but adds complexity to the compaction code, and will probably > > > lose most of the speed advantages. > > > > I figured the bigger problem was actually garbage collecting the terms > > which didn't have references any more - in my quick glance through the > > code. I admit I don't understand how it all works quite as well as I'd > > like. > > Each term has a chunked list of postings (which are (docid, wdf) pairs) > so there's not really much to the "garbage collecting" part - if that > list is empty, the term is no longer present in the database.Sure - more knowing which postings matter (since I'd filter by a callback based on value[0]) at compact time.> > It still compacts a lot better with compact: > > > > [brong at imap14 brong]$ du -s * > > 1198332 xapian.57 > > [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp > [...] > > [brong at imap14 brong]$ du -s * > > 759992 xapian.58 > > How does that break down by table though? Looking at the sizes of the > corresponding .DB files before and after will give you most of this info > (the base files are much smaller, and essentially proportional in size).DB is slightly larger now (another week's data indexed), but it should be fine. compact result: (ignore cyrus.indexed.db - that's our internal format to track which records need to be indexed). [brong at imap14 brong]$ du -s xapian.60/* 8 xapian.60/cyrus.indexed.db 4 xapian.60/iamchert 4 xapian.60/position.baseA 8 xapian.60/position.baseB 496332 xapian.60/position.DB 4 xapian.60/postlist.baseA 4 xapian.60/postlist.baseB 214840 xapian.60/postlist.DB 4 xapian.60/record.baseA 4 xapian.60/record.baseB 1072 xapian.60/record.DB 4 xapian.60/termlist.baseA 4 xapian.60/termlist.baseB 67100 xapian.60/termlist.DB Using the direct copy version. Looks like most of the difference is the postlist. [brong at imap14 brong]$ du -s xapian.61/* 8 xapian.61/cyrus.indexed.db 0 xapian.61/flintlock 4 xapian.61/iamchert 8 xapian.61/position.baseA 8 xapian.61/position.baseB 500224 xapian.61/position.DB 12 xapian.61/postlist.baseA 12 xapian.61/postlist.baseB 619196 xapian.61/postlist.DB 4 xapian.61/record.baseA 4 xapian.61/record.baseB 1088 xapian.61/record.DB 4 xapian.61/termlist.baseA 4 xapian.61/termlist.baseB 93680 xapian.61/termlist.DB> With multiple databases as above, the docids are interleaved, so it > might be worth trying to open each source and copy its documents to > destdb in turn for better locality of reference, and so better cache > use.Sounds sane. I'll try that.> That's assuming the raw docid order doesn't matter to you.Not at all. I really don't care about docids at all.> Is the CYRUSID value always non-empty? If it is, you can actually > iterate that stream of values directly - something like:It sure should be - I've had a couple of cases where it wound up without CyrusID on a message... only discovered because it triggered assertion failures on read. They should always have a CyrusId.> Xapian::ValueIterator it; > for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it != srcdb.valuestream_end(SLOT_CYRUSID); it++) { > if (cb((*it).c_str(), rock)) { > Xapian::docid did = it->get_docid(); > Xapian::Document doc = srcdb.get_document(did); > destdb.add_document(doc); > } > }Going to give that a go, with separate document reads. Thanks.> I suspect the document copying actually takes most of the time here, > unless you're discarding a lot of them.Yeah, I think so too. Anyway - I'll keep working on this code. We need something that does what it does. Thanks again, Bron. -- Bron Gondwana brong at fastmail.fm