thr3ads.net - Xapian discuss - [Xapian-discuss] Compact databases and removing stale records at the same time [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Bron Gondwana

2013-Jun-19 14:07 UTC

[Xapian-discuss] Compact databases and removing stale records at the same time

On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:> On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote:
> > The advantage of compact - it runs approximately 8 times as fast (we
> > are CPU limited in each case - writing to tmpfs first, then rsyncing
> > to the destination) and it takes approximately 75% of the space of a
> > fresh database with maximum compaction.
> > 
> > The downside of compact - can't delete things (or at least I
can't see
> > how).
> 
> A lot of the reason why compact is fast is because it pretty much just
> treats the contents of each posting list chunk as opaque data (if it
> renumbers, it has to adjust the header of the first chunk from each
> postlist, if I remember correctly).
Yeah, fair enough!
> In order to be able to delete documents as it went, it would have to
> modify any postlist chunks which contained those documents.  That's
> possible, but adds complexity to the compaction code, and will probably
> lose most of the speed advantages.
I figured the bigger problem was actually garbage collecting the terms
which didn't have references any more - in my quick glance through the
code.  I admit I don't understand how it all works quite as well as I'd
like.
> The destination of a document-by-document copy should be close to
> compact for most of the tables.  If changes were flushed during the
> copy, the postlist table may still benefit from compaction (if there
> was only one batch, then the postlist table should be compact too).
Well, I've switched to a single pass without all the transactional foo
(see pasted below)

It still compacts a lot better with compact:

[brong at imap14 brong]$ du -s *
1198332	xapian.57
[brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C
/etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
compressing data:57 to data:58 for user.brong (active
temp:92,archive:3,meta:0,data:57)
compacting databases
building cyrus.indexed.db
copying from tempdir to destination
renaming tempdir into place
finished compact of user.brong (active temp:92,archive:3,meta:0,data:58)

real	1m23.956s
user	0m32.604s
sys	0m5.948s
[brong at imap14 brong]$ du -s *
759992	xapian.58


That's about 75% of the uncompacted size.
> >     catch (const Xapian::Error &err) {
> > 	syslog(LOG_ERR, "IOERROR: Xapian: caught exception: %s:
%s",
> > 		    err.get_context().c_str(), err.get_description().c_str());
> 
> If err has a context, err.get_description() will actually include it.
Heh.  That's code I inherited and hadn't even looked at.  I don't
think I've
ever actually seen it called.  I'll simplify it.
> > 	/* copy all matching documents to the new DB */
> > 	for (Xapian::MSetIterator i = matches.begin() ; i != matches.end() ;
++i) {
> > 	    Xapian::Document doc = i.get_document();
> 
> This requires creating an in-memory structure of size get_doccount(), so
> won't scale well to really big databases.
My test DB is about 90k documents.  Lots of terms though, particularly some of
the emails which contain thousands of lines of syslog output.

[brong at imap14 brong]$ delve -1 -a xapian.58 | wc -l
6370721
[brong at imap14 brong]$ delve -1 -V0 xapian.58 | wc -l
89419
> But there's no need to run a match just to be able to iterate all the
> [...]
> There's no need to use transactions to do this - outside of
> [...]
v2:

    try {
	/* set up a cursor to read from all the source databases */
	Xapian::Database srcdb = Xapian::Database();
	while (*sources) {
	    srcdb.add_database(Xapian::Database(*sources++));
	}

	/* create a destination database */
	Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest,
Xapian::DB_CREATE);

	/* copy all matching documents to the new DB */
	Xapian::PostingIterator it;
	for (it = srcdb.postlist_begin(""); it !=
srcdb.postlist_end(""); it++) {
	    Xapian::docid did = *it;
	    Xapian::Document doc = srcdb.get_document(did);
	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
	    if (cb(cyrusid.c_str(), rock)) {
		destdb.add_document(doc);
	    }
	}

	/* commit all changes explicitly */
	destdb.commit();
    }

FYI: SLOT_CYRUSID is just 0.

Thanks heaps for your help on this.  Honestly, it's not a deal-breaker for
us to use this much CPU.  It's a pain, but it's still heaps cheaper than
re-indexing everything, and our servers are IO bound more than CPU bound, so
eating a bit more CPU is survivable.

Bron.


-- 
  Bron Gondwana
  brong at fastmail.fm

Olly Betts

2013-Jun-20 00:24 UTC

head link

[Xapian-discuss] Compact databases and removing stale records at the same time

On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana
wrote:> On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> > In order to be able to delete documents as it went, it would have to
> > modify any postlist chunks which contained those documents. 
That's
> > possible, but adds complexity to the compaction code, and will
probably
> > lose most of the speed advantages.
> 
> I figured the bigger problem was actually garbage collecting the terms
> which didn't have references any more - in my quick glance through the
> code.  I admit I don't understand how it all works quite as well as
I'd
> like.
Each term has a chunked list of postings (which are (docid, wdf) pairs)
so there's not really much to the "garbage collecting" part - if
that
list is empty, the term is no longer present in the database.
> > The destination of a document-by-document copy should be close to
> > compact for most of the tables.  If changes were flushed during the
> > copy, the postlist table may still benefit from compaction (if there
> > was only one batch, then the postlist table should be compact too).
> 
> Well, I've switched to a single pass without all the transactional foo
> (see pasted below)
> 
> It still compacts a lot better with compact:
> 
> [brong at imap14 brong]$ du -s *
> 1198332	xapian.57
> [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C
/etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
[...]> [brong at imap14 brong]$ du -s *
> 759992	xapian.58
How does that break down by table though?  Looking at the sizes of the
corresponding .DB files before and after will give you most of this info
(the base files are much smaller, and essentially proportional in size).
> 	Xapian::Database srcdb = Xapian::Database();
> 	while (*sources) {
> 	    srcdb.add_database(Xapian::Database(*sources++));
> 	}
> 
> 	/* create a destination database */
> 	Xapian::WritableDatabase destdb = Xapian::WritableDatabase(dest,
Xapian::DB_CREATE);
> 
> 	/* copy all matching documents to the new DB */
> 	Xapian::PostingIterator it;
> 	for (it = srcdb.postlist_begin(""); it !=
srcdb.postlist_end(""); it++) {
> 	    Xapian::docid did = *it;
> 	    Xapian::Document doc = srcdb.get_document(did);
> 	    std::string cyrusid = doc.get_value(SLOT_CYRUSID);
> 	    if (cb(cyrusid.c_str(), rock)) {
> 		destdb.add_document(doc);
> 	    }
> 	}
With multiple databases as above, the docids are interleaved, so it
might be worth trying to open each source and copy its documents to
destdb in turn for better locality of reference, and so better cache
use.

That's assuming the raw docid order doesn't matter to you.

Is the CYRUSID value always non-empty?  If it is, you can actually
iterate that stream of values directly - something like:

	Xapian::ValueIterator it;
	for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it !=
srcdb.valuestream_end(SLOT_CYRUSID); it++) {
	    if (cb((*it).c_str(), rock)) {
		Xapian::docid did = it->get_docid();
		Xapian::Document doc = srcdb.get_document(did);
		destdb.add_document(doc);
	    }
	}

This will omit any documents with an empty value in SLOT_CYRUSID though
(there's no distinction between an empty and unset value).

I suspect the document copying actually takes most of the time here,
unless you're discarding a lot of them.

Cheers,
    Olly

Bron Gondwana

2013-Jun-20 11:57 UTC

head link

[Xapian-discuss] Compact databases and removing stale records at the same time

On Thu, Jun 20, 2013, at 10:24 AM, Olly Betts wrote:> On Thu, Jun 20, 2013 at 12:07:19AM +1000, Bron Gondwana wrote:
> > On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote:
> > > In order to be able to delete documents as it went, it would have
to
> > > modify any postlist chunks which contained those documents. 
That's
> > > possible, but adds complexity to the compaction code, and will
probably
> > > lose most of the speed advantages.
> > 
> > I figured the bigger problem was actually garbage collecting the terms
> > which didn't have references any more - in my quick glance through
the
> > code.  I admit I don't understand how it all works quite as well
as I'd
> > like.
> 
> Each term has a chunked list of postings (which are (docid, wdf) pairs)
> so there's not really much to the "garbage collecting" part -
if that
> list is empty, the term is no longer present in the database.
Sure - more knowing which postings matter (since I'd filter by a callback
based on value[0]) at compact time.
> > It still compacts a lot better with compact:
> > 
> > [brong at imap14 brong]$ du -s *
> > 1198332	xapian.57
> > [brong at imap14 brong]$ time sudo -u cyrus /usr/cyrus/bin/squatter -C
/etc/cyrus/imapd-sloti14d5p4.conf -v -u brong -z data -t data -T /tmpfs/xap.tmp
> [...]
> > [brong at imap14 brong]$ du -s *
> > 759992	xapian.58
> 
> How does that break down by table though?  Looking at the sizes of the
> corresponding .DB files before and after will give you most of this info
> (the base files are much smaller, and essentially proportional in size).
DB is slightly larger now (another week's data indexed), but it should be
fine.

compact result: (ignore cyrus.indexed.db - that's our internal format to
track which records need to be indexed).

[brong at imap14 brong]$ du -s xapian.60/*
8	xapian.60/cyrus.indexed.db
4	xapian.60/iamchert
4	xapian.60/position.baseA
8	xapian.60/position.baseB
496332	xapian.60/position.DB
4	xapian.60/postlist.baseA
4	xapian.60/postlist.baseB
214840	xapian.60/postlist.DB
4	xapian.60/record.baseA
4	xapian.60/record.baseB
1072	xapian.60/record.DB
4	xapian.60/termlist.baseA
4	xapian.60/termlist.baseB
67100	xapian.60/termlist.DB

Using the direct copy version.  Looks like most of the difference is the
postlist.

[brong at imap14 brong]$ du -s xapian.61/*
8	xapian.61/cyrus.indexed.db
0	xapian.61/flintlock
4	xapian.61/iamchert
8	xapian.61/position.baseA
8	xapian.61/position.baseB
500224	xapian.61/position.DB
12	xapian.61/postlist.baseA
12	xapian.61/postlist.baseB
619196	xapian.61/postlist.DB
4	xapian.61/record.baseA
4	xapian.61/record.baseB
1088	xapian.61/record.DB
4	xapian.61/termlist.baseA
4	xapian.61/termlist.baseB
93680	xapian.61/termlist.DB
> With multiple databases as above, the docids are interleaved, so it
> might be worth trying to open each source and copy its documents to
> destdb in turn for better locality of reference, and so better cache
> use.
Sounds sane.  I'll try that.
> That's assuming the raw docid order doesn't matter to you.
Not at all.  I really don't care about docids at all.
> Is the CYRUSID value always non-empty?  If it is, you can actually
> iterate that stream of values directly - something like:
It sure should be - I've had a couple of cases where it wound up without
CyrusID on a message... only discovered because it triggered assertion failures
on read.  They should always have a CyrusId.
> 	Xapian::ValueIterator it;
> 	for (it = srcdb.valuestream_begin(SLOT_CYRUSID); it !=
srcdb.valuestream_end(SLOT_CYRUSID); it++) {
> 	    if (cb((*it).c_str(), rock)) {
> 		Xapian::docid did = it->get_docid();
> 		Xapian::Document doc = srcdb.get_document(did);
> 		destdb.add_document(doc);
> 	    }
> 	}
Going to give that a go, with separate document reads.  Thanks.
> I suspect the document copying actually takes most of the time here,
> unless you're discarding a lot of them.
Yeah, I think so too.  Anyway - I'll keep working on this code.  We need
something that does what it does.

Thanks again,

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm

Xapian discuss - Jun 2013 - Compact databases and removing stale records at the same time

[Xapian-discuss] Compact databases and removing stale records at the same time

[Xapian-discuss] Compact databases and removing stale records at the same time

[Xapian-discuss] Compact databases and removing stale records at the same time