On Tue, Jun 01, 2004 at 12:06:47PM +0200, Robert Pollak
wrote:> I am using the Xapian-0.8.0 snapshot from 15-Apr-2004 02:14, and I am
> using the same Xapian::WritableDatabase instance for indexing and
searching.
>
> Currently each search causes a database flush, which is slow.
> How can I avoid this flush?
I think the first question is what are you searching for?
There are two things which a search does which will cause a flush. The
first is opening posting lists for the terms in the search. If any of
the search terms was in a document added, removed, or modified since
the last flush, quartz will flush.
The other is calculating the average document length.
It might be possible to avoid the search entirely - for example, if you
just want to see if there's a document with a certain UID term, you can
look at the postlist for that term, rather than running a full blown
search. Then you'll only cause a flush if you try to update a document
added since the last flush. This is how omindex and scriptindex work.
If you really need to do a search, a boolean search would avoid the need
to calculate the average document length, so will avoid flushing except
when you search for a term used in a recent change.
If you need a probabilistic search, it shouldn't be hard to adjust the
average length to account for buffered changes without forcing a flush.
But you'd still force a flush when you search for a term used in a
recent change.
> It seems that I have to modify Xapian to either
> - search only the already flushed data (eventually missing some hits)
This is easy to do - just open the database read-only (i.e. as a
Xapian::Database). Whenever you explicitly flush or get a
Xapian::DatabaseModifiedError, call reopen() on the read-only database.
> or
> - search the un-flushed data, too.
If you need searching of unflushed data without forcing a flush when you
hit a term used in a recent change, you need to generate modified
posting lists on the fly. This is certainly possible, but it's rather
fiddly.
Cheers,
Olly