On Mon, Oct 31, 2005 at 10:49:56AM +0100, Arjen van der Meijden
wrote:> For the searches through the set of documents Xapian/Omega work very
> well. For the alerts on new document, I'm wondering how to do it.
> The naive approach is of course to just store a list of searchqueries
> that users have asked to be alerted on.
> But it will likely run in hundreds of such queries, maybe even a few
> thousand. Each added set of documents would than be "searched" by
each
> stored query, and even though that can be done quite fast (prepend
> B=Q$newId1 B=Q$newId2 etc to the query) it may (will?) be too much
> overhead nonetheless.
You could add documents to a new database which is then searched for
alerting purposes. Then merge than into the main database. And repeat.
Probabilistic weights will generally be different in the alerting
database because the term frequencies won't be the same as they would
be in the full database. But they won't be wrong, just different.
You seemed to be saying that many searches would be boolean only
anyway, so the weights wouldn't even apply then.
Another approach would be to make use of the hypothetical feature
someone suggested which would allow restricting a match to a range
of document ids. So you could note the result of get_lastdocid(),
add the new documents, then run a query restricted to the new
documents. While the feature is currently hypothetical, it wouldn't be
hard to implement, and should be better than a long list of B=Q$newId<n>
filters. If the update isn't too huge and you run alerting right after
updating, all the postlist Btree blocks should still be cached too.
> Reversing the process might be quite nice, but how to do that? The
> queries should be stored as documents and the document should be "the
> query". But than you lose the boolean logic and phrase operators from
> the original query.
If phrases and boolean logic are fairly rare, you could handle those
alerts specially.
We used to have a "batchenquire" feature (inherited from muscat 3.6)
which merged a list of alerting queries into one enormous query, then
ran it to produce a large M-set, then split that M-set up to produce
an M-set for each query. But it didn't support boolean filters or
operators, and the large query and M-set splitting weren't especially
cheap. The batchenquire code never fully worked and got dropped a long
time ago, but it is another approach.
Cheers,
Olly