Robert Kaye wrote:> Hi!
>
> After more work I've managed to get Xapian to work better all around
> than our previous text search engine. I've been able to tweak, or work
> around the idiosyncrasies of our data/setup and am getting results I'm
> quite happy with. Big thumbs up to the Xapian dev team!
>
> I often times get rewarded with good chocolate from various corners of
> the world. Do you folks like good chocolate? I can share!
I probably eat too much chocolate already, but thanks for the thought!
> Onward: However, indexing speed is a bit a of a problem for me;
> smaller indexes build faster than the previous system, large indexes
> take about 2-3 times as long.
>
> I noticed disk access is very spikey -- every 3-5 seconds utilization
> goes to 100%. Then there are long periods of 100% disk utilization. My
> CPU is never very busy -- at most I find a 50% - 60% load. And the
> indexing process only uses about 5% of available RAM. Is there any way
> I can instruct Xapian to use more resources to speed up indexing?
Yes - you can control the number of documents Xapian batches together
during an indexing session using the XAPIAN_FLUSH_THRESHOLD environment
variable, which controls the number of document changes to buffer. The
default is to buffer changes to 10000 documents in memory, and then
apply them to disk. This is probably a little low for modern systems
(unless the documents are very large). Too low a setting will result in
slow indexing, due to having to do lots of extra IO. Too high a setting
will be even slower, due to the indexing process getting into swap. The
ideal is probably to find a value which results in around half of your
memory being used by the indexing process (leaving the other half of the
memory available for the system to cache disk pages).
If you're currently only seeing aruond 5% of RAM used, I'd try setting
XAPIAN_FLUSH_THRESHOLD=100000 - hopefully that will result in about 50%
being used.
Ideally, this would tune itself automatically, but we've not had time to
get around to that yet. There are also lots of other things we could
work on to improve indexing speed, which we've not got around to either.
Another approach, if your index is large, is to build several small
indexes, and then merge them together with "xapian-compact".
(Probably
with the "-m" option to do multipass merging, if you end up with
_lots_
of small indexes.) This method is a bit clunky, but can build large
indexes much faster than doing it in one go. At some point, we'll
probably merge xapian-compact into the main API, but for now it's only
available as a standalone executable.
> My> index could also be built on a RAM disk -- I suspect that would help,
> but I'm curious as to what the best practices are...
It might well do; if you experiment with this, I'd be interested to know
how the speed compares
--
Richard