thr3ads.net - Xapian discuss - [Xapian-discuss] Compact databases and removing stale records at the same time [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Bron Gondwana

2013-Jun-19 03:29 UTC

[Xapian-discuss] Compact databases and removing stale records at the same time

I'm trying to compact (or at least merge) multiple databases, while
stripping search records which are no longer required.

Backstory:

I've inherited the Cyrus IMAPd xapian-based search code from Greg Banks when
he left Opera.

One of the unfinished parts was removing expunged emails from the search
database.

We moved from having a single search database to supporting multiple databases. 
In our operational environment, we actually run four separate "tiers"
of search database.  The active tier is stored on tmpfs, meaning we don't
pay any IO cost.  If we lose that due to a server crash, we just have to check
every folder for unindexed messages.

Once per day, we compact that to "meta", which is stored on SSD.

Once per week, we compact to "data" - merging with the existing
"data" database.  They get a new name each time, so for example my
current databases are:

temp:91 archive:3 data:54

If I was to compress all those, I would first create a new database temp:92 and
then compress the contents of those three (which are then read-only) into
archive:4.  Once that's complete, I would rewrite the active file as
"temp:92 archive:4".

I'd like to clean out stale records at the same time - but this doesn't
seem possible via the compact API.  So I have two different functions, one that
iterates, and one that uses compact.

The advantage of compact - it runs approximately 8 times as fast (we are CPU
limited in each case - writing to tmpfs first, then rsyncing to the destination)
and it takes approximately 75% of the space of a fresh database with maximum
compaction.

The downside of compact - can't delete things (or at least I can't see
how).

Does anyone have any suggestions for a better way to do this?  I'll paste
the code for the two different functions below (Cyrus is written in C - hence
the C-compatible API interface).

I would prefer not to write to the source databases at all - the idea is that
all except the "temp" database are read-only for all callers.

Thanks,

Bron.

----

int xapian_compact_dbs(const char *dest, const char **sources)
{
    int r = 0;

    try {
	Xapian::Compactor *c = new Xapian::Compactor;

	while (*sources) {
	    c->add_source(*sources++);
	}

	c->set_destdir(dest);

	/* we never write to compresion targets again */
	c->set_compaction_level(Xapian::Compactor::FULLER);

	c->set_multipass(true);

	c->compact();
    }
    catch (const Xapian::Error &err) {
	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
		    err.get_context().c_str(), err.get_description().c_str());
	r = IMAP_IOERROR;
    }

    return r;
}


/* cb returns true if document should be copied, false if not */
int xapian_filter(const char *dest, const char **sources,
		  int (*cb)(const char *cyrusid, void *rock),
		  void *rock)
{
    int r = 0;
    int count = 0;

    try {
	/* set up a cursor to read from all the source databases */
	Xapian::Database *srcdb = new Xapian::Database();
	while (*sources) {
	    srcdb->add_database(Xapian::Database(*sources++));
	}
	Xapian::Enquire enquire(*srcdb);
	enquire.set_query(Xapian::Query::MatchAll);
	Xapian::MSet matches = enquire.get_mset(0, srcdb->get_doccount());

	/* create a destination database */
	Xapian::WritableDatabase *destdb = new Xapian::WritableDatabase(dest,
Xapian::DB_CREATE_OR_OPEN);
	destdb->begin_transaction();

	/* copy all matching documents to the new DB */
	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i) {
	    Xapian::Document doc = i.get_document();
	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
	    if (cb(cyrusid.c_str(), rock)) {
		destdb->add_document(doc);
		count++;
		/* commit occasionally */
		if (count % 1024 == 0) {
		    destdb->commit_transaction();
		    destdb->begin_transaction();
		}
	    }
	}

	/* commit all the remaining transactions */
	destdb->commit_transaction();
	delete destdb;

	delete srcdb;
    }
    catch (const Xapian::Error &err) {
	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
		    err.get_context().c_str(), err.get_description().c_str());
	r = IMAP_IOERROR;
    }

    return r;
}

-- 
  Bron Gondwana
  brong at fastmail.fm

Olly Betts

2013-Jun-19 05:49 UTC

head link

[Xapian-discuss] Compact databases and removing stale records at the same time

On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana
wrote:> The advantage of compact - it runs approximately 8 times as fast (we
> are CPU limited in each case - writing to tmpfs first, then rsyncing
> to the destination) and it takes approximately 75% of the space of a
> fresh database with maximum compaction.
> 
> The downside of compact - can't delete things (or at least I can't
see
> how).
A lot of the reason why compact is fast is because it pretty much just
treats the contents of each posting list chunk as opaque data (if it
renumbers, it has to adjust the header of the first chunk from each
postlist, if I remember correctly).

In order to be able to delete documents as it went, it would have to
modify any postlist chunks which contained those documents.  That's
possible, but adds complexity to the compaction code, and will probably
lose most of the speed advantages.

The destination of a document-by-document copy should be close to
compact for most of the tables.  If changes were flushed during the
copy, the postlist table may still benefit from compaction (if there
was only one batch, then the postlist table should be compact too).

I've thought before that being able to compact tables independently
might be useful.
> Does anyone have any suggestions for a better way to do this?  I'll
> paste the code for the two different functions below (Cyrus is written
> in C - hence the C-compatible API interface).
[...]>     catch (const Xapian::Error &err) {
> 	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s: %s",
> 		    err.get_context().c_str(), err.get_description().c_str());
If err has a context, err.get_description() will actually include it.
> 	Xapian::Enquire enquire(*srcdb);
> 	enquire.set_query(Xapian::Query::MatchAll);
> 	Xapian::MSet matches = enquire.get_mset(0, srcdb->get_doccount());
[...]> 	/* copy all matching documents to the new DB */
> 	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ; ++i)
{
> 	    Xapian::Document doc = i.get_document();
This requires creating an in-memory structure of size get_doccount(), so
won't scale well to really big databases.

But there's no need to run a match just to be able to iterate all the
documents in the database - you can just iterate the postlist for the
empty term (via Xapian::Database::postlist_begin("")).  I'd expect
that would be a fair bit faster if you're CPU limited.

See the copydatabase example for code which uses this approach to do a
document-by-document copy.
> 		if (count % 1024 == 0) {
> 		    destdb->commit_transaction();
> 		    destdb->begin_transaction();
> 		}
There's no need to use transactions to do this - outside of
transactions, you'll get an automatic commit periodically anyway (if
you want to force a commit, you can just call destdb->commit()).

There's not currently much difference between the two approaches, but
the auto-commit is likely to get smarter with time (currently it is just
based on number of documents changed, but it should probably take memory
used to store changes as the primary factor).  Using transactions is
telling Xapian that you want those exact chunks of changes committed
atomically, which gives little room to be smarter.

Cheers,
    Olly

Bron Gondwana

2013-Jun-19 14:07 UTC

head link

[Xapian-discuss] Compact databases and removing stale records at the same time

On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:> On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote:
> > The advantage of compact - it runs approximately 8 times as fast (we
> > are CPU limited in each case - writing to tmpfs first, then rsyncing
> > to the destination) and it takes approximately 75% of the space of a
> > fresh database with maximum compaction.
> > 
> > The downside of compact - can't delete things (or at least I
can't see
> > how).
> 
> A lot of the reason why compact is fast is because it pretty much just
> treats the contents of each posting list chunk as opaque data (if it
> renumbers, it has to adjust the header of the first chunk from each
> postlist, if I remember correctly).
Yeah, fair enough!
> In order to be able to delete documents as it went, it would have to
> modify any postlist chunks which contained those documents.  That's
> possible, but adds complexity to the compaction code, and will probably
> lose most of the speed advantages.
I figured the bigger problem was actually garbage collecting the terms
which didn't have references any more - in my quick glance through the
code.  I admit I don't understand how it all works quite as well as I'd
like.
> The destination of a document-by-document copy should be close to
> compact for most of the tables.  If changes were flushed during the
> copy, the postlist table may still benefit from compaction (if there
> was only one batch, then the postlist table should be compact too).
Well, I've switched to a single pass without all the transactional foo
(see pasted below)

It still compacts a lot better with compact:

[brong at imap14 brong]$ du -s *
1198332	xapian.57
[brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C
/etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
compressing data:57 to data:58 for user.brong (active
temp:92,archive:3,meta:0,data:57)
compacting databases
building cyrus.indexed.db
copying from tempdir to destination
renaming tempdir into place
finished compact of user.brong (active temp:92,archive:3,meta:0,data:58)

real	1m23.956s
user	0m32.604s
sys	0m5.948s
[brong at imap14 brong]$ du -s *
759992	xapian.58


That's about 75% of the uncompacted size.
> >     catch (const Xapian::Error &err) {
> > 	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s:
%s",
> > 		    err.get_context().c_str(), err.get_description().c_str());
> 
> If err has a context, err.get_description() will actually include it.
Heh.  That's code I inherited and hadn't even looked at.  I don't
think I've
ever actually seen it called.  I'll simplify it.
> > 	/* copy all matching documents to the new DB */
> > 	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ;
++i) {
> > 	    Xapian::Document doc = i.get_document();
> 
> This requires creating an in-memory structure of size get_doccount(), so
> won't scale well to really big databases.
My test DB is about 90k documents.  Lots of terms though, particularly some of
the emails which contain thousands of lines of syslog output.

[brong at imap14 brong]$ delve -1 -a xapian.58 | wc -l
6370721
[brong at imap14 brong]$ delve -1 -V0 xapian.58 | wc -l
89419
> But there's no need to run a match just to be able to iterate all the
> [...]
> There's no need to use transactions to do this - outside of
> [...]
v2:

    try {
	/* set up a cursor to read from all the source databases */
	Xapian::Database srcdb = Xapian::Database();
	while (*sources) {
	    srcdb.add_database(Xapian::Database(*sources++));
	}

	/* create a destination database */
	Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest,
Xapian::DB_CREATE);

	/* copy all matching documents to the new DB */
	Xapian::PostingIterator it;
	for (it = srcdb.postlist_begin(""); it !=
srcdb.postlist_end(""); it++) {
	    Xapian::docid did = *it;
	    Xapian::Document doc = srcdb.get_document(did);
	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
	    if (cb(cyrusid.c_str(), rock)) {
		destdb.add_document(doc);
	    }
	}

	/* commit all changes explicitly */
	destdb.commit();
    }

FYI: SLOT_CYRUSID is just 0.

Thanks heaps for your help on this.  Honestly, it's not a deal-breaker for
us to use this much CPU.  It's a pain, but it's still heaps cheaper than
re-indexing everything, and our servers are IO bound more than CPU bound, so
eating a bit more CPU is survivable.

Bron.


-- 
  Bron Gondwana
  brong at fastmail.fm

Maybe Matching Threads

Search for more reasonably related threads

Xapian discuss - Jun 2013 - Compact databases and removing stale records at the same time

[Xapian-discuss] Compact databases and removing stale records at the same time

[Xapian-discuss] Compact databases and removing stale records at the same time

[Xapian-discuss] Compact databases and removing stale records at the same time

Maybe Matching Threads