On 3-11-2005 13:32, tech@dbx.co.uk wrote:> Is there anything that can be done to speed up phrase searching? It is
> currently a show stopper for our CV search system with queries for common
> terms taking several minutes to execute. Simply ANDing the terms together
> will return in 1-3 seconds.
If you know beforehand what your phrase will be like and how you'll
search them you may be able to. I.e. if you have system paths and look
through them in "tree-order", you can just build up the subpaths and
index them as as normal terms (/usr/local/bin/omega can be /usr,
/usr/local, /usr/local/bin).
But if its just plain text and you want normal sentences to be
retrievable... you're probaby just stuck to finding each document
containing the terms and checking whether those terms are in the correct
order. There are searchengines which only use word-pairs and can
therefore not correctly identify hits (they also see "foo bar",
"bar
test" as a match for "foo bar test").
It may be faster to combine such word-pairs with normal phrase
searching, build a query that checks for the correct word-pairs and the
phrase.
The drawback is of course that you'll increase the size of your postlist
quite a bit (you don't need it in the position table however). But the
advantage should be that you can decrease the list of documents a lot
better than with the normal "and search" which is the basis for the
phrase search.
> I keep thinking that I must be missing something in either the way I index
> or the way I (or rather the QueryParser) constructs the queries.
In the general case, I don't think there really is a better way. But if
space is no problem and the speed of the position table is the most
important part, you may be able to increase the size of the indexes to
decrease the number of documents to look through.
Olly already mentioned using Flint, using xapian-compact to further
decrease the size of the database may help a lot for searches. You may
want to keep two versions of your database, the non-compacted for
updating and the fully compacted for searches.
For Flint the compaction is a bit less dramatic than for Quartz, with
Flint our 14G non-compacted database decreases to 12G compacted (which
uses zlib-compression as well). The drawback of compaction is of course
the time it takes, it takes one hour to compact on our machine.
Best regards,
Arjen