thr3ads.net - Xapian discuss - [Xapian-discuss] Optimal usage of xapian-compact for merging [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Henry C.

2010-Feb-02 12:49 UTC

[Xapian-discuss] Optimal usage of xapian-compact for merging

Greets,

I've been wondering, what's the sane/optimal use of xapian-compact when
merging many indexes with a view to maximum merging performance?

The obvious:
- only use -F on the final db.
- use -m since I'm merging more than 3 dbs.

Best strategy?
a)  loop:  merge batches (of say 50, where the individual db's are small)
into a temp index, then merge the (larger) temp into the final product...
end-loop

b)  loop:  merge batches (of say 50, where the individual db's are small)
into many temp indexes... end-loop
Then merge those (larger) temps into the final product.

Finally, presumably it's best to use the same blocksize (-b) as the
underlying filesystem?  I see the default is 8K, but the default blocksize
on (eg) ext3 is 4k...  or am I way off here?

Thanks
Henry

Olly Betts

2010-Feb-03 05:40 UTC

head link

[Xapian-discuss] Optimal usage of xapian-compact for merging

On Tue, Feb 02, 2010 at 02:49:46PM +0200, Henry C.
wrote:> I've been wondering, what's the sane/optimal use of xapian-compact
when
> merging many indexes with a view to maximum merging performance?
> 
> The obvious:
> - only use -F on the final db.
That's not totally obvious, but is unlikely to make much difference either
way.
> - use -m since I'm merging more than 3 dbs.
Someone reported -m was slower for them, but it was certainly a win for me.
It does do more work, but without it, the postlist table is an N-way merge,
which scatters reads a lot.  So it's essentially an attempt to avoid being
so I/O bound.
> Best strategy?
> a)  loop:  merge batches (of say 50, where the individual db's are
small)
> into a temp index, then merge the (larger) temp into the final product...
> end-loop
> 
> b)  loop:  merge batches (of say 50, where the individual db's are
small)
> into many temp indexes... end-loop
> Then merge those (larger) temps into the final product.
Or just merge all the databases in a single invocation.

I don't have figures to compare these, and it may vary according to your
data, OS, FS, and/or hardware, so all I can really suggest is to try the
different approaches and see.  Do report if you find anything interesting.

Currently the grouping under -m is fairly crude - postlists are just merged
in pairs (plus a three if there are an odd number), and then the merged
lists are remerged in the same way until we have just one, but that may be
reasonable even for mismatched sizes.

It would probably be significantly faster not to use a Btree for the
intermediate stages, but just serialise it to a flat file - we will end up
rereading it in order.  That would only make a difference when merging more
than 3 databases though.

I should file a ticket for it - it would make a fairly self-contained project
for someone wanting to hack on Xapian without needing to understand much of the
internals.
> Finally, presumably it's best to use the same blocksize (-b) as the
> underlying filesystem?  I see the default is 8K, but the default blocksize
> on (eg) ext3 is 4k...  or am I way off here?
It should certainly not be smaller than the hardware blocksize (or else you
need to read the existing disk-block in order to write a Xapian-block).  A
multiple is fine though, and larger blocks are a bit more efficient.  I did
some tests a year or so ago which suggested 16KB might be slightly better than
8KB, but it is sufficiently close that it didn't seem to justify changing
the
default.

Cheers,
    Olly

Reasonably Related Threads

Search for more possibly parallel threads

Xapian discuss - Feb 2010 - Optimal usage of xapian-compact for merging

[Xapian-discuss] Optimal usage of xapian-compact for merging

[Xapian-discuss] Optimal usage of xapian-compact for merging

Reasonably Related Threads