Jason Tackaberry
2009-Oct-04 01:15 UTC
[Xapian-discuss] Filtering queries with many boolean terms
Hi all, I'm thinking of using xapian to index my growing mail collection, which I suppose is a reasonably common use-case. (The fact that gmane has used xapian to index some tens of millions of emails is what piqued my interest.) First I want to start off with how very impressed I've been with Xapian. The API is nice, feature-complete bindings are available for my preferred language, it's very fast, and it generally Just Works exactly how I expect. It's very good quality software. One of the features I'd like to have is gmail-like tagging of threads. This means I need to be able to add and remove tags to arbitrarily many documents after they have been indexed. Unfortunately, xapian is quite slow at updating many documents. Updating 10k documents by adding an XTAGfoo term takes 40 seconds (including db flush) on my system. One use-case for updating this many documents is, for example, applying a tag to existing emails in a mailing list. So, I am looking into an alternative approach. I'm already maintaining a table of all threads (gmail would call them conversations) in an sql database. All emails in the xapian database are given an XTID<n> term corresponding to the thread id in the sql database. If I want to search for all documents with the word "foo" and with tag "bar", I figured I'd first look up all thread ids with tag "bar" in SQL, and then use those ids as filters with the Xapian search. So the Xapian query might look like: (foo) (tid:0 tid:1 tid:2 tid:3) Where tid: is prefix for boolean term XTID. This translates to: Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR XTID3))) This works fine until I start to add many XTID terms, and then I start to notice that query time increases (about quadratically it seems) with the number of terms. This surprised me, because a search for "foo" by itself finishes in 0.0005 seconds with all matches fetched. I expected the XTID terms to apply as a filter, and therefore be no worse than the time it takes for the query to complete (with all matches fetched) without the XTID terms. Indeed, if the result set for query "foo" is sufficiently low (maybe a few thousand documents), it's actually faster for me to iterate over them all and manually check the XTID values myself. Is there any way to improve Xapian with this type of search? I've also observed that if I put "OR" between the tid: terms, it actually generates a different query that executes in less than half the time than without "OR": (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3) becomes: Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 * XTID2 OR 0 * XTID3))) This was another surprising result, because I understood that "OR" was implicit when multiple boolean terms were specified, so I didn't expect a different query to be produced. Lastly, if I remove the parens, a different query still is generated: (foo) tid:0 tid:1 tid:2 tid:3 becomes: Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR TID3))) Unfortunately this has about the same performance as the first form. Even with the fastest form (tid:0 OR tid:1 OR tid:2 [...]) it's still a bit disappointing. Hopefully someone can shed some insight as to what's happening, and how I might improve the performance. Thanks! Jason.
James Aylett
2009-Oct-05 18:27 UTC
[Xapian-discuss] Filtering queries with many boolean terms
On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:> (foo) (tid:0 tid:1 tid:2 tid:3)What are the brackets for? The '*' in the output is, I think, OP_SCALE_WEIGHT, which doesn't seem a good query structure for what you're trying to do. You actually want a FILTER query at top level; which you've figured out how to generate. I don't know why this isn't behaving as fast as you want (some tabulated figures on this, with information about the corpus size and so on might help others respond here). J -- James Aylett talktorex.co.uk - xapian.org - uncertaintydivision.org
Olly Betts
2009-Oct-06 10:00 UTC
[Xapian-discuss] Filtering queries with many boolean terms
On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:> Unfortunately, xapian is quite slow at updating many documents. > Updating 10k documents by adding an XTAGfoo term takes 40 seconds > (including db flush) on my system. One use-case for updating this many > documents is, for example, applying a tag to existing emails in a > mailing list.Currently we rewrite all the terms if any are changed, which is why it is a bit slow. This could be improved: http://trac.xapian.org/ticket/250 So one approach would just be to live with the 40 seconds for now and you'll get a nice speed up when that optimisation is implemented.> So, I am looking into an alternative approach. I'm already maintaining > a table of all threads (gmail would call them conversations) in an sql > database. All emails in the xapian database are given an XTID<n> term > corresponding to the thread id in the sql database. If I want to search > for all documents with the word "foo" and with tag "bar", I figured I'd > first look up all thread ids with tag "bar" in SQL, and then use those > ids as filters with the Xapian search. So the Xapian query might look > like: > > (foo) (tid:0 tid:1 tid:2 tid:3)Note that you really shouldn't build up query strings mechanically to pass to the QueryParser. It doesn't have a rigidly defined syntax, but is really a heuristic parser aiming to handle potentially poorly formed human written input (and those heuristics can change between releases) so it may not parse your built query as you intended, or may stop doing so. Instead parse "foo" and then append the filters using the API: // Assuming the tids are listed in vector<string> tids: Xapian::Query filter(Xapian::Query::OP_OR, tids.begin(), tids.end()); Xapian::Query q = qp->parse_query("foo"); q = Xapian::Query(Xapian::Query::OP_FILTER, q, filter);> Where tid: is prefix for boolean term XTID. This translates to: > > Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR > XTID3))) > > This works fine until I start to add many XTID terms, and then I start > to notice that query time increases (about quadratically it seems) with > the number of terms. This surprised me, because a search for "foo" by > itself finishes in 0.0005 seconds with all matches fetched. I expected > the XTID terms to apply as a filter, and therefore be no worse than the > time it takes for the query to complete (with all matches fetched) > without the XTID terms.Well the filter part will take time to evaluate, but will potentially reduce the work required for the probabilistic part. So it's not totally clear whether it will be fast with or without the filter, but I/O is usually the limiting factor (having to read a single block from disk takes many many CPU cycles) so I'd probably guess the filtered version would be slower if there are a lot of "XTID" terms in it.> Indeed, if the result set for query "foo" is sufficiently low (maybe a > few thousand documents), it's actually faster for me to iterate over > them all and manually check the XTID values myself.But this looks odd to me. When you say "query time increases", what are you actually timing here? If it includes the query parsing time then I suspect the quadratic behaviour is probably there. Pair-wise construction of a long query is still quadratic: http://trac.xapian.org/ticket/273 We've worked around this in places in the QueryParser by putting subqueries in a container and building a query from that, but it's quite possible this can still happen in some cases. Generally humans don't enter queries with more than a dozen terms, so quadratic behaviour usually doesn't kick in and so can easy escape notice. I'll try to produce a simple testcase later today.> I've also observed that if I put "OR" between the tid: terms, it > actually generates a different query that executes in less than half the > time than without "OR": > > (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3) > > becomes: > > Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 * > XTID2 OR 0 * XTID3)))This is just a different representation with the same meaning as the query above and should result in exactly the same internal representation in the matcher and so have exactly the same performance. If it's much faster, that suggests that this is a parsing issue.> Lastly, if I remove the parens, a different query still is generated: > > (foo) tid:0 tid:1 tid:2 tid:3 > > becomes: > > Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR > TID3)))This one should also result in the same internal representation in the matcher. Cheers, Olly