I've extended quartzcompact to allow it to read more than one quartz
database and merge their contents to produce a single output database.
It's handy if you have millions of documents to index - you can create a
number of more modestly sized databases, then merge these to produce a
single large database.
Using this, I recently managed to index 22 million mail messages from
gmane in just over 4 days. The indexing took a bit under 36 hours, and
the rest of the time was spent merging. I just guessed how big to make
the split databases, so with tuning you can probably do rather better.
But for the record, I set the environment variable
XAPIAN_FLUSH_THRESHOLD to 50,000, and put just under 500,000 documents
in each split database (the threshold was 500,000 including spam
messages, etc which don't get indexed). Having watched the process size
in top, I suspect 100,000 might be better than 50,000 for
XAPIAN_FLUSH_THRESHOLD.
At present, quartzcompact doesn't produce quite the same output from
merging as it would when compacting a single file. The issue is that
the keys in 3 tables don't exactly sort in docid order, so the merging
used doesn't write the keys in totally sorted order. I'm just testing
to see if this adversely affects the database size. If it does, I can
fix it, at the potential cost of a slightly slower merge.
You can get the updated quartzcompact.cc from here:
http://cvs.xapian.org/*checkout*/xapian/xapian-core/bin/quartzcompact.cc
You should be able to just drop it into a 0.8.5 xapian-core source tree.
If you try it, please report how you get on!
Cheers,
Olly