On Fri, Jun 11, 2010 at 11:57:10AM +0200, Henry C.
wrote:> I've had xapian-compact (without -F) sessions running for several days
now
> on 10 'merge' machines and I've noticed that the average
compaction
> average can swing wildly:
>
> 18% 76% 10% 19% 39% 13% 69% 43% 19% 42%
>
> The average so far is about 35% (ie, 65% reduction in target index sizes,
> which is unexpected and pleasingly welcomed).
>
> I'm curious about the large variance in those numbers -- simply the
wildly
> varying nature of the data being merged (typical website data), or some
> other factor I'm missing?
There are two types of unused space - each block in the Btree table may
not be 100% full, and some previously used blocks may now be marked as
free. Considering the latter case first:
Blocks inside the Btree tables are marked as free when entries are deleted
but the file size isn't (currently at least) ever reduced in normal
operation (we could only release blocks from the end of the file anyway)
- free blocks are just reused next time a new block is needed.
So if you delete a lot of documents, and then compact, the size will drop
by the space those documents used to use.
Also, when you make changes to a database, the existing version of the
documents, etc is kept intact and changes are written in to new blocks
(which may be reused blocks marked as free), and then when the changes
are committed, the old versions are marked as free. So if you add a
large batch of changes and then commit, you'll hand a lot of blocks
now marked as free.
And then there's how full each block is:
In normal operations, the blocks split when you can't insert a new entry,
so usually vary between 50% and 100% full, so you'd expect the average to
be about 75%. But if you append documents to a database in batches, then
the Btree tables will shift into "sequential addition mode" which will
result in new blocks being mostly full, so compacting a database you have
updated in this way won't save much space.
In fact what xapian-compact essentially does is to create a fresh table
and do a sequential copy of all the entries from the old table so that
they all get added in this mode.
So perhaps interesting, depending what interests you!
Cheers,
Olly