thr3ads.net - Xapian discuss - [Xapian-discuss] Filtering queries with many boolean terms [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Jason Tackaberry

2009-Oct-04 01:15 UTC

[Xapian-discuss] Filtering queries with many boolean terms

Hi all,

I'm thinking of using xapian to index my growing mail collection, which
I suppose is a reasonably common use-case.  (The fact that gmane has
used xapian to index some tens of millions of emails is what piqued my
interest.)

First I want to start off with how very impressed I've been with Xapian.
The API is nice, feature-complete bindings are available for my
preferred language, it's very fast, and it generally Just Works exactly
how I expect.  It's very good quality software.

One of the features I'd like to have is gmail-like tagging of threads.
This means I need to be able to add and remove tags to arbitrarily many
documents after they have been indexed.

Unfortunately, xapian is quite slow at updating many documents.
Updating 10k documents by adding an XTAGfoo term takes 40 seconds
(including db flush) on my system.  One use-case for updating this many
documents is, for example, applying a tag to existing emails in a
mailing list.

So, I am looking into an alternative approach.  I'm already maintaining
a table of all threads (gmail would call them conversations) in an sql
database.  All emails in the xapian database are given an XTID<n> term
corresponding to the thread id in the sql database.  If I want to search
for all documents with the word "foo" and with tag "bar", I
figured I'd
first look up all thread ids with tag "bar" in SQL, and then use those
ids as filters with the Xapian search.  So the Xapian query might look
like:

        (foo) (tid:0 tid:1 tid:2 tid:3)

Where tid: is prefix for boolean term XTID.  This translates to:

        Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR
        XTID3)))


This works fine until I start to add many XTID terms, and then I start
to notice that query time increases (about quadratically it seems) with
the number of terms.  This surprised me, because a search for "foo" by
itself finishes in 0.0005 seconds with all matches fetched.  I expected
the XTID terms to apply as a filter, and therefore be no worse than the
time it takes for the query to complete (with all matches fetched)
without the XTID terms.

Indeed, if the result set for query "foo" is sufficiently low (maybe a
few thousand documents), it's actually faster for me to iterate over
them all and manually check the XTID values myself.

Is there any way to improve Xapian with this type of search?

I've also observed that if I put "OR" between the tid: terms, it
actually generates a different query that executes in less than half the
time than without "OR":

        (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3)
        
becomes:
        
        Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 *
        XTID2 OR 0 * XTID3)))

This was another surprising result, because I understood that "OR" was
implicit when multiple boolean terms were specified, so I didn't expect
a different query to be produced.

Lastly, if I remove the parens, a different query still is generated:

        (foo) tid:0 tid:1 tid:2 tid:3
        
becomes:
        
        Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR
        TID3)))

Unfortunately this has about the same performance as the first form.

Even with the fastest form (tid:0 OR tid:1 OR tid:2 [...]) it's still a
bit disappointing.  Hopefully someone can shed some insight as to what's
happening, and how I might improve the performance.

Thanks!
Jason.

James Aylett

2009-Oct-05 18:27 UTC

head link

[Xapian-discuss] Filtering queries with many boolean terms

On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:
>         (foo) (tid:0 tid:1 tid:2 tid:3)
What are the brackets for? The '*' in the output is, I think,
OP_SCALE_WEIGHT, which doesn't seem a good query structure for what
you're trying to do.

You actually want a FILTER query at top level; which you've figured
out how to generate. I don't know why this isn't behaving as fast as
you want (some tabulated figures on this, with information about the
corpus size and so on might help others respond here).

J

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org

Olly Betts

2009-Oct-06 10:00 UTC

head link

[Xapian-discuss] Filtering queries with many boolean terms

On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry
wrote:> Unfortunately, xapian is quite slow at updating many documents.
> Updating 10k documents by adding an XTAGfoo term takes 40 seconds
> (including db flush) on my system.  One use-case for updating this many
> documents is, for example, applying a tag to existing emails in a
> mailing list.
Currently we rewrite all the terms if any are changed, which is why it
is a bit slow.  This could be improved:

http://trac.xapian.org/ticket/250

So one approach would just be to live with the 40 seconds for now and
you'll get a nice speed up when that optimisation is implemented.
> So, I am looking into an alternative approach.  I'm already maintaining
> a table of all threads (gmail would call them conversations) in an sql
> database.  All emails in the xapian database are given an XTID<n>
term
> corresponding to the thread id in the sql database.  If I want to search
> for all documents with the word "foo" and with tag
"bar", I figured I'd
> first look up all thread ids with tag "bar" in SQL, and then use
those
> ids as filters with the Xapian search.  So the Xapian query might look
> like:
> 
>         (foo) (tid:0 tid:1 tid:2 tid:3)
Note that you really shouldn't build up query strings mechanically to
pass to the QueryParser.  It doesn't have a rigidly defined syntax, but
is really a heuristic parser aiming to handle potentially poorly formed
human written input (and those heuristics can change between releases)
so it may not parse your built query as you intended, or may stop
doing so.

Instead parse "foo" and then append the filters using the API:

    // Assuming the tids are listed in vector<string> tids:
    Xapian::Query filter(Xapian::Query::OP_OR, tids.begin(), tids.end());
    Xapian::Query q = qp->parse_query("foo");
    q = Xapian::Query(Xapian::Query::OP_FILTER, q, filter);
> Where tid: is prefix for boolean term XTID.  This translates to:
> 
>         Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR
>         XTID3)))
> 
> This works fine until I start to add many XTID terms, and then I start
> to notice that query time increases (about quadratically it seems) with
> the number of terms.  This surprised me, because a search for
"foo" by
> itself finishes in 0.0005 seconds with all matches fetched.  I expected
> the XTID terms to apply as a filter, and therefore be no worse than the
> time it takes for the query to complete (with all matches fetched)
> without the XTID terms.
Well the filter part will take time to evaluate, but will potentially
reduce the work required for the probabilistic part.  So it's not
totally clear whether it will be fast with or without the filter,
but I/O is usually the limiting factor (having to read a single block
from disk takes many many CPU cycles) so I'd probably guess the filtered
version would be slower if there are a lot of "XTID" terms in it.
> Indeed, if the result set for query "foo" is sufficiently low
(maybe a
> few thousand documents), it's actually faster for me to iterate over
> them all and manually check the XTID values myself.
But this looks odd to me.  When you say "query time increases", what
are you actually timing here?  If it includes the query parsing time
then I suspect the quadratic behaviour is probably there.

Pair-wise construction of a long query is still quadratic:

http://trac.xapian.org/ticket/273

We've worked around this in places in the QueryParser by putting
subqueries in a container and building a query from that, but it's
quite possible this can still happen in some cases.  Generally
humans don't enter queries with more than a dozen terms, so quadratic
behaviour usually doesn't kick in and so can easy escape notice.

I'll try to produce a simple testcase later today.
> I've also observed that if I put "OR" between the tid: terms,
it
> actually generates a different query that executes in less than half the
> time than without "OR":
> 
>         (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3)
>         
> becomes:
>         
>         Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 *
>         XTID2 OR 0 * XTID3)))
This is just a different representation with the same meaning as the
query above and should result in exactly the same internal
representation in the matcher and so have exactly the same performance.
If it's much faster, that suggests that this is a parsing issue.
> Lastly, if I remove the parens, a different query still is generated:
> 
>         (foo) tid:0 tid:1 tid:2 tid:3
>         
> becomes:
>         
>         Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR
>         TID3)))
This one should also result in the same internal representation in the
matcher.

Cheers,
    Olly

Xapian discuss - Oct 2009 - Filtering queries with many boolean terms

[Xapian-discuss] Filtering queries with many boolean terms

[Xapian-discuss] Filtering queries with many boolean terms

[Xapian-discuss] Filtering queries with many boolean terms