Hi Markus,
I have no answers for most of your questions, but can confirm your
savings are quite large. Our largest database contains 1.276.595
documents and the total database size is about 25GB. The total
plain-text size of the corpus is about 15-16GB. Our compacted database
is about 16GB in size.
Our database gets only one update-run per day however, and the amount of
daily document changes that get added and replaced daily vary between
150-300 and 300-600 respectively. We have normally no or close to zero
deletes.
So much less changes, but our index is now 6 months old.
Our strategy is to leave a working copy for 'fast' updates and a
compacted version for faster retrieval. We haven't actually done any
recent benchmarks, but its supposed to be slightly faster this way both
in terms of indexing and retrieval.
Our compaction ratio is much less dramatic than yours, actually only our
postlist sees large savings:
position.DB 11G 11G 3,92%
postlist.DB 9.7G 4.5G 54,17%
record.DB 229M 220M 3,78%
termlist.DB 3.2G 3.0G 8,32%
value.DB 91M 46M 50,19%
I don't really understand your objection to run xapian-compact. The main
disadvantage of compaction is that your first few update-runs will have
to do a relatively large amount of (extra) block-splits. But then again,
you may actually gain in indexing-performance in your case for the
same reason retrieval performance should increase quite a bit.
For retrieval speed, most gains come from reducing the amount of I/O.
This is done in two ways, being smart which blocks to read and by simply
having less blocks to read in total.
In your case you can dramatically increase the file system's cache-hit
ratio if you have less than 16GB of memory in your server. And even if
you have 16GB or more, there are simply much less blocks to be read for
every single query, so you should still win.
If you have to offer a continuously updating database to your users, I'd
only do a xapian-compact after you've done some major change to the
database. And if you mean by 'all documents got rebuild about 5 times'
that you could also have started from scratch and just build a new index
with all the documents and throw away the old one once you're done, I
would do that too.
If you only update periodically, you could try to keep two copies of
your database, one 'working version' and one 'retrieval version'
which
is just a compacted version of the working copy. And in your case you
could also decide to replace the working copy with the retrieval copy
once in a while.
Best regards,
Arjen
On 5-9-2008 20:14 Markus W?rle wrote:> Hi,
>
> I just ran xapian-compact on an index which comsumes about 12 GB of
> disk space, containing 858.383 documents with an average doclength of
> 169.018, and got surprised by a huge compactification factor which I
> haven't expected. After compactification, the index needed only 3.8 GB
> on disk anymore.
>
> My expection was that it would only shrink about 25% or so, because of
> the average allocation of b-tree blocks with I expected to be about 75%.
>
> This is what xapian-compact said:
>
> postlist: Reduced by 76.888% 2444640K (3179480K -> 734840K)
> record: Reduced by 65.4923% 1446352K (2208432K -> 762080K)
> termlist: Reduced by 67.2607% 1110312K (1650760K -> 540448K)
> position: Reduced by 56.6145% 2342160K (4137032K -> 1794872K)
> value: Reduced by 81.2667% 397264K (488840K -> 91576K)
> spelling: Size unchanged (0K)
> synonym: Size unchanged (0K)
>
> My Index' brief history:
>
> The index was once built from scrach with add_document(), and got
> updated by a large amount of replace_document_by_term(),
> add_document(), and delete_document_by_term() over a longer period
> (about 2 month or so). Some numbers: about 1 million modifications per
> day, and thereof about 4000 document adds, and 3000 removes.
> Additionally, in this 2-month-period, all documents got rebuild about
> 5 times by using replace_document_by_term() on a unique term for each
> document.
>
> So my question is: Is this reasonable? Respectively, do you have any
> idea why my b-trees are such empty? Does Xapian merge weakly populated
> blocks again?
>
> I am currently planning to stop indexing once a day to run xapian-
> compact, but I am uncertain if this whould "denaturate" the
system. I
> have many modifications, and althought "best indexing
performance" is
> not really a point in my use-case, I feel somehow bad about
> manipulating a natural-balanced b-tree in a non-changing environment.
> What do you suggest?
>
> Thanks,
> mrks
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>