On Sat, Sep 25, 2004 at 01:08:06AM -0600, gervin23
wrote:> i've been performing some tests with xapian on a set of docs numbering
> 86801 with average length of 291.895. i've also removed stop words to
> help with speed and size but i'm noticing some serious performance hits
> when it comes to phrase searches, sometimes hitting +14s as compared to
> ~0.01s i'm getting with most boolean searches.
Is this in real use, or when running test queries after building the
database? One effect to be aware of is that queries will be a lot
slower when nothing is cached. In real use, pretty much all the
non-leaf blocks from the B-trees being used for the search will be
cached pretty quickly, as well as many commonly used leaf blocks.
Are the phrase queries that are slow just a phrase, or combined with
other terms? For queries with a phrase and other terms, we could defer
the phrase check in some cases to reduce how often we actually need to
do it (there's a bugzilla entry for this).
There's also a plan to tweak the indexing to eliminate some common cases
which are currently phrases but don't really need to be (e.g.
"e-mail").
> the following quote from the documentation has peaked my interest:
> "implementing an ultra-compact read-only backend which would take a
> quartz b-tree and convert it to something like a cdb hashed file" ...
> "If you need serious performance, implementing such a backend is worth
> considering.". firstly, would this help with phrase searches and
> secondly, what are the details in implementing such a beast? i don't
> need write access once indexed so this may be something i'd like to
try.
It would help everything a bit just by reducing the amount of I/O
required. A large search system is going to become I/O bound
eventually, and at that point you win by reducing the amount of I/O and
improving the useful information in every block cached.
But first, have you run the database through quartzcompact? This
produces Btrees which are fully compact and at revision 1 - they're
smaller and revision 1 means they're inherently a bit more efficient to
access.
Also relevant are some plans I've been working on for revised encoding
schemes to replace those currently used by quartz. Some tests I did
showed the new positionlist encoding to be ~15% more compact than what
we currently use, and we can save on the other tables too (more on some
less on others) which will all help.
Anyway, as to the the question of producing a cdb backend - it isn't a
huge undertaking as a lot of quartz can be reused by using the same
encodings quartz uses, and iterate through all the key/tag pairs in each
Btree (the code in quartzcompact.cc shows how to do this) and put these
into a cdb (or whatever format).
Then a bit of glue code is needed to reimplement the bits of the Btree
external API which are used to retrieval.
Looking at the cdb API, it's possible cdb isn't actually suitable as we
need to be able to support a cursor which can move to the next and
previous keys. The alternative would be to not chunk the posting lists,
which is probably going to hurt performance. Or store all chunks for a
term with the same key in the cdb (they allow multiple entries with the
same key), but that may make skipping over chunks expensive as we'd need
to read each chunk to check if it was the one we wanted...
Hmm, I also suspect the licence (which appears to be "You may distribute
unmodified copies of the cdb package.") isn't GPL compatible. So a cdb
backend would be fine for in-house use, but we couldn't distribute the
cdb library itself with Xapian, and nobody could distribute cdb-enabled
Xapian packages.
Also cdb is limited to 4GB per table, which is going to be too
restrictive for high end users.
So "something like cdb" might need identifying first!
I'm not sure quite what the saving would be. You could calculate the
exact size of the cdb correspoding to a particular database - just
iterate over the key/tag pairs, and total up the sizes of each, plus 24
bytes per pair, plus 2048 bytes per table. But I've not tried this.
I suspect quartzcompact plus the new encodings will make more difference
than converting to cdb. But you could use the new encoding in a cdb
too, and get the gains from both!
> one other question if i may. how would one get just the record numbers
> for the entire estimated MSet? my html layout at the moment is two
> frames, a table of contents on the left and search/document on the
> right. in order to show (in the toc) which folder/subfolders/documents
> have hits, i'm having to issue 'enquire.get_mset(0, 1000)' to
shelve the
> first 1000 record numbers then issue another 'enquire.get_mset(0,
10)'
> to display the first ten, twenty, etc... of course, the bottleneck is
> the first get_mset so removing that step would help wonderfully.
But get_mset only actually gets the document ids and weights - document
data, etc is lazily fetched, so if you don't ask for it, you don't incur
any extra overhead.
Getting 10 matches is faster not because we're reading less document
data but because we're able to deploy our shortcutting optimisations
more often and more effectively.
> hardware-wise, i'm running a Pentium III (Katmai) @ 551MHz with 512MB
> RAM and 7200 (hdparm reports 39.45 MB/sec).
How big are the database tables (i.e. what's the output of ls -l on the
database directory)?
Cheers,
Olly