thr3ads.net - Xapian discuss - [Xapian-discuss] How to index a lot of documents quickly [Mar 2005]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2005-Mar-03 00:05 UTC

[Xapian-discuss] How to index a lot of documents quickly

I've extended quartzcompact to allow it to read more than one quartz
database and merge their contents to produce a single output database.

It's handy if you have millions of documents to index - you can create a
number of more modestly sized databases, then merge these to produce a
single large database.

Using this, I recently managed to index 22 million mail messages from
gmane in just over 4 days.  The indexing took a bit under 36 hours, and
the rest of the time was spent merging.  I just guessed how big to make
the split databases, so with tuning you can probably do rather better.

But for the record, I set the environment variable
XAPIAN_FLUSH_THRESHOLD to 50,000, and put just under 500,000 documents
in each split database (the threshold was 500,000 including spam
messages, etc which don't get indexed).  Having watched the process size
in top, I suspect 100,000 might be better than 50,000 for
XAPIAN_FLUSH_THRESHOLD.

At present, quartzcompact doesn't produce quite the same output from
merging as it would when compacting a single file.  The issue is that
the keys in 3 tables don't exactly sort in docid order, so the merging
used doesn't write the keys in totally sorted order.  I'm just testing
to see if this adversely affects the database size.  If it does, I can
fix it, at the potential cost of a slightly slower merge.

You can get the updated quartzcompact.cc from here:

http://cvs.xapian.org/*checkout*/xapian/xapian-core/bin/quartzcompact.cc

You should be able to just drop it into a 0.8.5 xapian-core source tree.

If you try it, please report how you get on!

Cheers,
    Olly

Olly Betts

2005-Mar-03 13:51 UTC

head link

[Xapian-discuss] How to index a lot of documents quickly

On Thu, Mar 03, 2005 at 12:05:15AM +0000, Olly Betts
wrote:> At present, quartzcompact doesn't produce quite the same output from
> merging as it would when compacting a single file.  The issue is that
> the keys in 3 tables don't exactly sort in docid order, so the merging
> used doesn't write the keys in totally sorted order.  I'm just
testing
> to see if this adversely affects the database size.  If it does, I can
> fix it, at the potential cost of a slightly slower merge.
I'm currently running the output of the gmane merge through
quartzcompact.  It's reduced the size of the record table by 43% (!), so
I think I need to address this (the postlist table is unsuprisingly
unchanged, and the value table isn't used so that will be too - it's
still working on the other 2).

Cheers,
    Olly

Xapian discuss - Mar 2005 - How to index a lot of documents quickly

[Xapian-discuss] How to index a lot of documents quickly

[Xapian-discuss] How to index a lot of documents quickly