Henry C.
2010-Feb-02 12:49 UTC
[Xapian-discuss] Optimal usage of xapian-compact for merging
Greets, I've been wondering, what's the sane/optimal use of xapian-compact when merging many indexes with a view to maximum merging performance? The obvious: - only use -F on the final db. - use -m since I'm merging more than 3 dbs. Best strategy? a) loop: merge batches (of say 50, where the individual db's are small) into a temp index, then merge the (larger) temp into the final product... end-loop b) loop: merge batches (of say 50, where the individual db's are small) into many temp indexes... end-loop Then merge those (larger) temps into the final product. Finally, presumably it's best to use the same blocksize (-b) as the underlying filesystem? I see the default is 8K, but the default blocksize on (eg) ext3 is 4k... or am I way off here? Thanks Henry
Olly Betts
2010-Feb-03 05:40 UTC
[Xapian-discuss] Optimal usage of xapian-compact for merging
On Tue, Feb 02, 2010 at 02:49:46PM +0200, Henry C. wrote:> I've been wondering, what's the sane/optimal use of xapian-compact when > merging many indexes with a view to maximum merging performance? > > The obvious: > - only use -F on the final db.That's not totally obvious, but is unlikely to make much difference either way.> - use -m since I'm merging more than 3 dbs.Someone reported -m was slower for them, but it was certainly a win for me. It does do more work, but without it, the postlist table is an N-way merge, which scatters reads a lot. So it's essentially an attempt to avoid being so I/O bound.> Best strategy? > a) loop: merge batches (of say 50, where the individual db's are small) > into a temp index, then merge the (larger) temp into the final product... > end-loop > > b) loop: merge batches (of say 50, where the individual db's are small) > into many temp indexes... end-loop > Then merge those (larger) temps into the final product.Or just merge all the databases in a single invocation. I don't have figures to compare these, and it may vary according to your data, OS, FS, and/or hardware, so all I can really suggest is to try the different approaches and see. Do report if you find anything interesting. Currently the grouping under -m is fairly crude - postlists are just merged in pairs (plus a three if there are an odd number), and then the merged lists are remerged in the same way until we have just one, but that may be reasonable even for mismatched sizes. It would probably be significantly faster not to use a Btree for the intermediate stages, but just serialise it to a flat file - we will end up rereading it in order. That would only make a difference when merging more than 3 databases though. I should file a ticket for it - it would make a fairly self-contained project for someone wanting to hack on Xapian without needing to understand much of the internals.> Finally, presumably it's best to use the same blocksize (-b) as the > underlying filesystem? I see the default is 8K, but the default blocksize > on (eg) ext3 is 4k... or am I way off here?It should certainly not be smaller than the hardware blocksize (or else you need to read the existing disk-block in order to write a Xapian-block). A multiple is fine though, and larger blocks are a bit more efficient. I did some tests a year or so ago which suggested 16KB might be slightly better than 8KB, but it is sufficiently close that it didn't seem to justify changing the default. Cheers, Olly