Bron Gondwana
2013-Jun-19 03:29 UTC
[Xapian-discuss] Compact databases and removing stale records at the same time
I'm trying to compact (or at least merge) multiple databases, while stripping search records which are no longer required. Backstory: I've inherited the Cyrus IMAPd xapian-based search code from Greg Banks when he left Opera. One of the unfinished parts was removing expunged emails from the search database. We moved from having a single search database to supporting multiple databases. In our operational environment, we actually run four separate "tiers" of search database. The active tier is stored on tmpfs, meaning we don't pay any IO cost. If we lose that due to a server crash, we just have to check every folder for unindexed messages. Once per day, we compact that to "meta", which is stored on SSD. Once per week, we compact to "data" - merging with the existing "data" database. They get a new name each time, so for example my current databases are: temp:91 archive:3 data:54 If I was to compress all those, I would first create a new database temp:92 and then compress the contents of those three (which are then read-only) into archive:4. Once that's complete, I would rewrite the active file as "temp:92 archive:4". I'd like to clean out stale records at the same time - but this doesn't seem possible via the compact API. So I have two different functions, one that iterates, and one that uses compact. The advantage of compact - it runs approximately 8 times as fast (we are CPU limited in each case - writing to tmpfs first, then rsyncing to the destination) and it takes approximately 75% of the space of a fresh database with maximum compaction. The downside of compact - can't delete things (or at least I can't see how). Does anyone have any suggestions for a better way to do this? I'll paste the code for the two different functions below (Cyrus is written in C - hence the C-compatible API interface). I would prefer not to write to the source databases at all - the idea is that all except the "temp" database are read-only for all callers. Thanks, Bron. ---- int xapian_compact_dbs(const char *dest, const char **sources) { int r = 0; try { Xapian::Compactor *c = new Xapian::Compactor; while (*sources) { c->add_source(*sources++); } c->set_destdir(dest); /* we never write to compresion targets again */ c->set_compaction_level(Xapian::Compactor::FULLER); c->set_multipass(true); c->compact(); } catch (const Xapian::Error &err) { syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s", err.get_context().c_str(), err.get_description().c_str()); r = IMAP_IOERROR; } return r; } /* cb returns true if document should be copied, false if not */ int xapian_filter(const char *dest, const char **sources, int (*cb)(const char *cyrusid, void *rock), void *rock) { int r = 0; int count = 0; try { /* set up a cursor to read from all the source databases */ Xapian::Database *srcdb = new Xapian::Database(); while (*sources) { srcdb->add_database(Xapian::Database(*sources++)); } Xapian::Enquire enquire(*srcdb); enquire.set_query(Xapian::Query::MatchAll); Xapian::MSet matches = enquire.get_mset(0, srcdb->get_doccount()); /* create a destination database */ Xapian::WritableDatabase *destdb = new Xapian::WritableDatabase(dest, Xapian::DB_CREATE_OR_OPEN); destdb->begin_transaction(); /* copy all matching documents to the new DB */ for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) { Xapian::Document doc = i.get_document(); std::string cyrusid = doc.get_value(SLOT_CYRUSID); if (cb(cyrusid.c_str(), rock)) { destdb->add_document(doc); count++; /* commit occasionally */ if (count % 1024 == 0) { destdb->commit_transaction(); destdb->begin_transaction(); } } } /* commit all the remaining transactions */ destdb->commit_transaction(); delete destdb; delete srcdb; } catch (const Xapian::Error &err) { syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s", err.get_context().c_str(), err.get_description().c_str()); r = IMAP_IOERROR; } return r; } -- Bron Gondwana brong at fastmail.fm
Olly Betts
2013-Jun-19 05:49 UTC
[Xapian-discuss] Compact databases and removing stale records at the same time
On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote:> The advantage of compact - it runs approximately 8 times as fast (we > are CPU limited in each case - writing to tmpfs first, then rsyncing > to the destination) and it takes approximately 75% of the space of a > fresh database with maximum compaction. > > The downside of compact - can't delete things (or at least I can't see > how).A lot of the reason why compact is fast is because it pretty much just treats the contents of each posting list chunk as opaque data (if it renumbers, it has to adjust the header of the first chunk from each postlist, if I remember correctly). In order to be able to delete documents as it went, it would have to modify any postlist chunks which contained those documents. That's possible, but adds complexity to the compaction code, and will probably lose most of the speed advantages. The destination of a document-by-document copy should be close to compact for most of the tables. If changes were flushed during the copy, the postlist table may still benefit from compaction (if there was only one batch, then the postlist table should be compact too). I've thought before that being able to compact tables independently might be useful.> Does anyone have any suggestions for a better way to do this? I'll > paste the code for the two different functions below (Cyrus is written > in C - hence the C-compatible API interface).[...]> catch (const Xapian::Error &err) { > syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s", > err.get_context().c_str(), err.get_description().c_str());If err has a context, err.get_description() will actually include it.> Xapian::Enquire enquire(*srcdb); > enquire.set_query(Xapian::Query::MatchAll); > Xapian::MSet matches = enquire.get_mset(0, srcdb->get_doccount());[...]> /* copy all matching documents to the new DB */ > for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) { > Xapian::Document doc = i.get_document();This requires creating an in-memory structure of size get_doccount(), so won't scale well to really big databases. But there's no need to run a match just to be able to iterate all the documents in the database - you can just iterate the postlist for the empty term (via Xapian::Database::postlist_begin("")). I'd expect that would be a fair bit faster if you're CPU limited. See the copydatabase example for code which uses this approach to do a document-by-document copy.> if (count % 1024 == 0) { > destdb->commit_transaction(); > destdb->begin_transaction(); > }There's no need to use transactions to do this - outside of transactions, you'll get an automatic commit periodically anyway (if you want to force a commit, you can just call destdb->commit()). There's not currently much difference between the two approaches, but the auto-commit is likely to get smarter with time (currently it is just based on number of documents changed, but it should probably take memory used to store changes as the primary factor). Using transactions is telling Xapian that you want those exact chunks of changes committed atomically, which gives little room to be smarter. Cheers, Olly
Bron Gondwana
2013-Jun-19 14:07 UTC
[Xapian-discuss] Compact databases and removing stale records at the same time
On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:> On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote: > > The advantage of compact - it runs approximately 8 times as fast (we > > are CPU limited in each case - writing to tmpfs first, then rsyncing > > to the destination) and it takes approximately 75% of the space of a > > fresh database with maximum compaction. > > > > The downside of compact - can't delete things (or at least I can't see > > how). > > A lot of the reason why compact is fast is because it pretty much just > treats the contents of each posting list chunk as opaque data (if it > renumbers, it has to adjust the header of the first chunk from each > postlist, if I remember correctly).Yeah, fair enough!> In order to be able to delete documents as it went, it would have to > modify any postlist chunks which contained those documents. That's > possible, but adds complexity to the compaction code, and will probably > lose most of the speed advantages.I figured the bigger problem was actually garbage collecting the terms which didn't have references any more - in my quick glance through the code. I admit I don't understand how it all works quite as well as I'd like.> The destination of a document-by-document copy should be close to > compact for most of the tables. If changes were flushed during the > copy, the postlist table may still benefit from compaction (if there > was only one batch, then the postlist table should be compact too).Well, I've switched to a single pass without all the transactional foo (see pasted below) It still compacts a lot better with compact: [brong at imap14 brong]$ du -s * 1198332 xapian.57 [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C /etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp compressing data:57 to data:58 for user.brong (active temp:92,archive:3,meta:0,data:57) compacting databases building cyrus.indexed.db copying from tempdir to destination renaming tempdir into place finished compact of user.brong (active temp:92,archive:3,meta:0,data:58) real 1m23.956s user 0m32.604s sys 0m5.948s [brong at imap14 brong]$ du -s * 759992 xapian.58 That's about 75% of the uncompacted size.> > catch (const Xapian::Error &err) { > > syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s", > > err.get_context().c_str(), err.get_description().c_str()); > > If err has a context, err.get_description() will actually include it.Heh. That's code I inherited and hadn't even looked at. I don't think I've ever actually seen it called. I'll simplify it.> > /* copy all matching documents to the new DB */ > > for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) { > > Xapian::Document doc = i.get_document(); > > This requires creating an in-memory structure of size get_doccount(), so > won't scale well to really big databases.My test DB is about 90k documents. Lots of terms though, particularly some of the emails which contain thousands of lines of syslog output. [brong at imap14 brong]$ delve -1 -a xapian.58 | wc -l 6370721 [brong at imap14 brong]$ delve -1 -V0 xapian.58 | wc -l 89419> But there's no need to run a match just to be able to iterate all the > [...] > There's no need to use transactions to do this - outside of > [...]v2: try { /* set up a cursor to read from all the source databases */ Xapian::Database srcdb = Xapian::Database(); while (*sources) { srcdb.add_database(Xapian::Database(*sources++)); } /* create a destination database */ Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest, Xapian::DB_CREATE); /* copy all matching documents to the new DB */ Xapian::PostingIterator it; for (it = srcdb.postlist_begin(""); it != srcdb.postlist_end(""); it++) { Xapian::docid did = *it; Xapian::Document doc = srcdb.get_document(did); std::string cyrusid = doc.get_value(SLOT_CYRUSID); if (cb(cyrusid.c_str(), rock)) { destdb.add_document(doc); } } /* commit all changes explicitly */ destdb.commit(); } FYI: SLOT_CYRUSID is just 0. Thanks heaps for your help on this. Honestly, it's not a deal-breaker for us to use this much CPU. It's a pain, but it's still heaps cheaper than re-indexing everything, and our servers are IO bound more than CPU bound, so eating a bit more CPU is survivable. Bron. -- Bron Gondwana brong at fastmail.fm