2009/10/23 Henry <henka at cityweb.co.za>:> Either Xapian is not as IO intensive as I always thought, or I'm
missing
> something.
>
> I've been running some tests to assess how many search nodes I'll
need for a
> nnnGB index to ensure ~1s search query performance.
>
> The idea was to reduce the number of nodes needed using SSDs (since the
> performance gains (eg) on an IO intensive DB are staggering) versus the
> number needed using standard SATA disks (ie, larger index on SSDs using
less
> nodes, versus smaller indexes on more nodes using slower SATA hard drives).
>
> Anyway, the results are disappointing. ?The SSD provides no appreciable
> performance gain at all (aside: ?the SSD was using ext2 since it was also
> used to test a DB app, which didn't need the journalling overhead of
ext3 -
> this might explain the .10 - .80 second average *slower* performance of the
> SSD).
>
> My gut is that Xapian is more sequential-read intensive (not random IO)
> which would explain this disappointing result. ?Am I right?
Xapian tries hard to do sequential-reads, yes; the database is block
structured (by default 8k blocks) so chunks of data are read at once,
and the main data is stored in sorted order. However, the type of
load it puts on the IO system very much depends on the size of the
database, compared to the size of the main memory in the system (which
is used by the OS for caching). Also, more complex queries will
involve more random access (skipping between data for different
terms).
For a "small" database (probably up to around twice the size of RAM),
once the OS cache of the database is "hot", I'd expect the speed
of
the underlying disk to be of low importance. However, for a much
bigger database, I'd expect the speed of storage to be far more
important.
A few of suggestions:
- try forcing the OS cache to clear (eg, by doing "echo 3 >
/proc/sys/vm/drop_caches") and comparing the times of uncached
queries.
- try performing a phrase search, and comparing its times across
different backends.
- try profiling (eg, with oprofile, or just with "time") to see if
you're CPU or IO bound.
The types of queries you're performing will have quite an effect. If
you're using things like an ValueWeightPostingSource to add document
weights to the query, that will probably mainly use cached data, and
add to the CPU load rather than the IO load.
--
Richard