richard at lemurconsulting.com
2006-Dec-06 00:04 UTC
[Xapian-devel] Bug and patch for +terms with wildcards
In current Xapian SVN HEAD, there is a bug in the query parser concerned with the handling of wildcard terms with a "+" prefix. Specifically, a query such as "+foo* bar" will be parsed by the query parser into Xapian::Query("bar") if there are no terms in the database which start "foo". Instead, since the "+" term cannot be matched, I believe this query should return no documents. I've put a patch together to fix this issue, but it requires the introduction of a new query operator, to mark a query as matching no documents (as opposed to a query created with the default query constructor, which represents an undefined query). I've called this operator "OP_MATCH_NOTHING", and it takes no subqueries. I believe this should be public, since it may be useful for people trying to write their own query parsers, rather than relying on the builtin query parser. It's possible that a similar approach would be a neat solution for representing "alldocument" queries. Currently, a special query term can be created which matches all documents by creating a leaf query for which the term is the null string. This is a somewhat "magic" and unobvious approach - instead, we could introduce an "OP_MATCH_ALL" nullary operator, which would be converted to a postlist which matches all documents. It's not clear why an empty term should magically match all documents, rather than none, or indeed why it should have any special meaning. The patch includes test cases for the bug, fixes the problem, and I think is the neatest approach, but since it adds a new, publically visible, operator to Xapian::Query, I thought I'd better put the patch up for review on the mailing list rather than commit it directly. It's attached to this email, so, comments welcome! -- Richard -------------- next part -------------- Index: queryparser/queryparser.lemony ==================================================================--- queryparser/queryparser.lemony (revision 7552) +++ queryparser/queryparser.lemony (working copy) @@ -148,6 +148,9 @@ add_to_query(q, Query::OP_OR, Query(*t, 1, pos)); ++t; } + if (q.empty()) { + return new Query(Query::OP_MATCH_NOTHING); + } return new Query(q); } Index: matcher/localmatch.cc ==================================================================--- matcher/localmatch.cc (revision 7552) +++ matcher/localmatch.cc (working copy) @@ -468,6 +468,12 @@ postlist_from_query(query->subqs[1], matcher, is_bool), matcher, db->get_doccount()); + case Xapian::Query::OP_MATCH_NOTHING: { + Assert(query->subqs.size() == 0); + LeafPostList *pl = new EmptyPostList(); + pl->set_termweight(new Xapian::BoolWeight()); + RETURN(pl); + } } Assert(false); RETURN(NULL); Index: tests/queryparsertest.cc ==================================================================--- tests/queryparsertest.cc (revision 7552) +++ tests/queryparsertest.cc (working copy) @@ -655,7 +655,7 @@ qobj = queryparser.parse_query("muscle*", Xapian::QueryParser::FLAG_WILDCARD); TEST_EQUAL(qobj.get_description(), "Xapian::Query((muscle:(pos=1) OR musclebound:(pos=1)))"); qobj = queryparser.parse_query("meat*", Xapian::QueryParser::FLAG_WILDCARD); - TEST_EQUAL(qobj.get_description(), "Xapian::Query()"); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)"); qobj = queryparser.parse_query("musc*", Xapian::QueryParser::FLAG_WILDCARD); TEST_EQUAL(qobj.get_description(), "Xapian::Query((muscat:(pos=1) OR muscle:(pos=1) OR musclebound:(pos=1) OR muscular:(pos=1)))"); qobj = queryparser.parse_query("mutt*", Xapian::QueryParser::FLAG_WILDCARD); @@ -664,6 +664,29 @@ // were in the database or not): qobj = queryparser.parse_query("mUTTON++"); TEST_EQUAL(qobj.get_description(), "Xapian::Query(mutton:(pos=1))"); + // Regression test: check that wildcards work with +terms. + unsigned flags = Xapian::QueryParser::FLAG_WILDCARD | + Xapian::QueryParser::FLAG_LOVEHATE; + qobj = queryparser.parse_query("+mai* main", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query((main:(pos=1) AND_MAYBE main:(pos=2)))"); + // Regression test (if we had a +term which was a wildcard and wasn't + // present, the query could still match documents). + qobj = queryparser.parse_query("foo* main", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(main:(pos=2))"); + qobj = queryparser.parse_query("+foo* main", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)"); + qobj = queryparser.parse_query("foo* +main", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(main:(pos=2))"); + qobj = queryparser.parse_query("+foo* +main", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)"); + qobj = queryparser.parse_query("foo* mai", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(mai:(pos=2))"); + qobj = queryparser.parse_query("+foo* mai", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)"); + qobj = queryparser.parse_query("foo* +mai", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(mai:(pos=2))"); + qobj = queryparser.parse_query("+foo* +mai", flags); + TEST_EQUAL(qobj.get_description(), "Xapian::Query(<nodocuments>)"); return true; } Index: tests/api_anydb.cc ==================================================================--- tests/api_anydb.cc (revision 7552) +++ tests/api_anydb.cc (working copy) @@ -208,6 +208,39 @@ return true; } +// tests for the right document count for a wildcard query +static bool test_wildquery1() +{ + Xapian::QueryParser queryparser; + unsigned flags = Xapian::QueryParser::FLAG_WILDCARD | + Xapian::QueryParser::FLAG_LOVEHATE; + queryparser.set_stemmer(Xapian::Stem("english")); + queryparser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL); + Xapian::Database db = get_database("apitest_simpledata"); + queryparser.set_database(db); + Xapian::Enquire enquire(db); + + Xapian::Query qobj = queryparser.parse_query("th*", flags); + enquire.set_query(qobj); + Xapian::MSet mymset = enquire.get_mset(0, 10); + // Check that 6 documents were returned. + TEST_MSET_SIZE(mymset, 6); + + qobj = queryparser.parse_query("notindb* this", flags); + enquire.set_query(qobj); + mymset = enquire.get_mset(0, 10); + // Check that 6 documents were returned. + TEST_MSET_SIZE(mymset, 6); + + qobj = queryparser.parse_query("+notindb* this", flags); + enquire.set_query(qobj); + mymset = enquire.get_mset(0, 10); + // Check that 0 documents were returned. + TEST_MSET_SIZE(mymset, 0); + + return true; +} + // tests a query across multiple databases static bool test_multidb1() { @@ -1644,6 +1677,7 @@ {"simplequery1", test_simplequery1}, {"simplequery2", test_simplequery2}, {"simplequery3", test_simplequery3}, + {"wildquery1", test_wildquery1}, {"multidb1", test_multidb1}, {"multidb2", test_multidb2}, {"multidb3", test_multidb3}, Index: include/xapian/query.h ==================================================================--- include/xapian/query.h (revision 7553) +++ include/xapian/query.h (working copy) @@ -95,7 +95,15 @@ /** Select an elite set from the subqueries, and perform * a query with these combined as an OR query. */ - OP_ELITE_SET = 10 + OP_ELITE_SET = 10, + + /** Match no documents. + * + * This operator takes no subqueries, and matches no documents. + * It can be a useful placeholder when parsing query strings, if + * a portion of the query can be shown to match no documents. + */ + OP_MATCH_NOTHING } op; /** Copy constructor. */ @@ -150,6 +158,9 @@ /** Apply the specified operator to a single Xapian::Query object. */ Query(Query::op op_, Xapian::Query q); + /** Make a query with a nullary operator (ie, no subqueries). */ + Query(Query::op op_); + /** Get the length of the query, used by some ranking formulae. * This value is calculated automatically - if you want to override * it you can pass a different value to Enquire::set_query(). @@ -285,6 +296,12 @@ */ void validate_query() const; + /** Simplify any matchnothing subqueries, either eliminating them, + * or setting this query to matchnothing, depending on the query + * operator. + */ + void simplify_matchnothing(); + /** Get a string describing the given query type. */ static std::string get_op_name(Xapian::Query::Internal::op_t op); Index: ChangeLog ==================================================================--- ChangeLog (revision 7553) +++ ChangeLog (working copy) @@ -1,3 +1,24 @@ +Tue Dec 05 22:25:28 GMT 2006 Richard Boulton <richard at lemurconsulting.com> + + * queryparser/queryparser.lemony: Fix parsing of queries of the + form "+foo* bar", where no terms in the database match the + wildcard "foo*", but bar does exist in the database. Previously, + such queries would be equivalent to "bar". Now, they will match + no documents. + * include/xapian/query.h,api/omqueryinternal.cc,api/omquery.cc: Add + an OM_MATCH_NOTHING query operator, and add a constructor for + nullary query operators (ie, queries with no subqueries). A + query with this operator represents a query which matches no + documents (such as a wildcard query which expands to no terms). + Add support for combining OM_MATCH_NOTHING with other query + operators, in the simplify step. Also, add support for + serialising queries with operator OM_MATCH_NOTHING. + * matcher/localmatch.cc: Convert an OM_MATCH_NOTHING query into an + EmptyPostList() at match time. + * tests/api_anydb.cc,tests/queryparsertest.cc: Test parsing of + wildcard queries with +terms, and performing a match with a query + with operator OM_MATCH_NOTHING resulting from a wildcard query. + Tue Dec 05 21:12:12 GMT 2006 Richard Boulton <richard at lemurconsulting.com> * include/xapian/query.h,api/omqueryinternal.cc: Fix query Index: api/omqueryinternal.cc ==================================================================--- api/omqueryinternal.cc (revision 7553) +++ api/omqueryinternal.cc (working copy) @@ -54,6 +54,7 @@ case Xapian::Query::OP_NEAR: case Xapian::Query::OP_PHRASE: case Xapian::Query::OP_ELITE_SET: + case Xapian::Query::OP_MATCH_NOTHING: return 0; case Xapian::Query::OP_FILTER: case Xapian::Query::OP_AND_MAYBE: @@ -70,6 +71,7 @@ { switch (op) { case Xapian::Query::Internal::OP_LEAF: + case Xapian::Query::OP_MATCH_NOTHING: return 0; case Xapian::Query::OP_FILTER: case Xapian::Query::OP_AND_MAYBE: @@ -166,6 +168,9 @@ case Xapian::Query::OP_ELITE_SET: result += "*" + om_tostring(parameter); break; + case Xapian::Query::OP_MATCH_NOTHING: + result += "!"; + break; } } return result; @@ -189,6 +194,7 @@ case Xapian::Query::OP_NEAR: name = "NEAR"; break; case Xapian::Query::OP_PHRASE: name = "PHRASE"; break; case Xapian::Query::OP_ELITE_SET: name = "ELITE_SET"; break; + case Xapian::Query::OP_MATCH_NOTHING: name = "MATCH_NOTHING"; break; } return name; } @@ -211,6 +217,9 @@ if (tname.empty()) return "<alldocuments>" + opstr; return tname + opstr; } + if (op == Xapian::Query::OP_MATCH_NOTHING) { + return "<nodocuments>"; + } opstr = " " + get_op_name(op) + " "; if (op == Xapian::Query::OP_NEAR || @@ -414,6 +423,9 @@ return qint_from_vector(Xapian::Query::OP_ELITE_SET, subqs, elite_set_size); } + case '!': { + return new Xapian::Query::Internal(Xapian::Query::OP_MATCH_NOTHING, 0); + } default: DEBUGLINE(UNKNOWN, "Can't parse remainder `" << p - 1 << "'"); throw Xapian::InvalidArgumentError("Invalid query string"); @@ -554,6 +566,68 @@ } } +void +Xapian::Query::Internal::simplify_matchnothing() +{ + subquery_list::iterator sq; + switch (op) { + case OP_PHRASE: + case OP_NEAR: + case OP_AND: + case OP_FILTER: + // Doing an "AND" type operation - if we've got any MATCH_NOTHING + // nodes, we match nothing. + for (sq = subqs.begin(); sq != subqs.end(); sq++) { + if ((*sq)->op == OP_MATCH_NOTHING) { + for (sq = subqs.begin(); sq != subqs.end(); sq++) + delete *sq; + subqs.clear(); + op = OP_MATCH_NOTHING; + return; + } + } + break; + case OP_ELITE_SET: + case OP_OR: + case OP_XOR: + // Doing an "OR" type operation - if we've got any MATCH_NOTHING + // subnodes, drop them; except that we mustn't become an empty + // node due to this, so we never drop a MATCH_NOTHING subnode + // if it's the only subnode. + sq = subqs.begin(); + while (sq != subqs.end() && subqs.size() > 1) { + if ((*sq)->op == OP_MATCH_NOTHING) { + delete *sq; + sq = subqs.erase(sq); + } else { + ++sq; + } + } + break; + case OP_AND_MAYBE: + // If left hand side is MATCH_NOTHING, we match nothing. + // If right hand side is MATCH_NOTHING, replace node with LHS. + // So, if either node is MATCH_NOTHING, replace node with LHS. + // Easiest way to do this is to remove the right hand node, + // and let simplify_query() perform the replacement of + // the unary operator with + Assert(subqs.size() == 2); + if (subqs[0]->op == OP_MATCH_NOTHING || + subqs[1]->op == OP_MATCH_NOTHING) { + sq = subqs.begin(); + sq++; + delete *sq; + subqs.erase(sq); + } + + break; + case OP_LEAF: + case OP_MATCH_NOTHING: + // Do nothing. + break; + } +} + Xapian::Query::Internal * Xapian::Query::Internal::simplify_query() { @@ -588,9 +662,14 @@ break; } - // If we have no subqueries, then we're simply an undefined query. - if (subqs.empty()) return 0; + // Simplify any "MATCH_NOTHING" nodes. + simplify_matchnothing(); + // If we have no subqueries, then we're simply an undefined query, + // unless we've got a nullary operator, such as OP_MATCH_NOTHING. + if (subqs.empty() && op != OP_MATCH_NOTHING) + return 0; + // Any nodes which are valid with only one subquery can be replaced by // that solitary subquery. if (subqs.size() == 1) { Index: api/omquery.cc ==================================================================--- api/omquery.cc (revision 7552) +++ api/omquery.cc (working copy) @@ -132,6 +132,17 @@ } } +Query::Query(Query::op op_) : internal(0) +{ + try { + start_construction(op_, 0); + end_construction(); + } catch (...) { + abort_construction(); + throw; + } +} + // Copy constructor Query::Query(const Query & copyme) : internal(copyme.internal)
On Wed, Dec 06, 2006 at 12:04:24AM +0000, richard at lemurconsulting.com wrote:> In current Xapian SVN HEAD, there is a bug in the query parser concerned > with the handling of wildcard terms with a "+" prefix. Specifically, > a query such as "+foo* bar" will be parsed by the query parser into > Xapian::Query("bar") if there are no terms in the database which start > "foo". Instead, since the "+" term cannot be matched, I believe this query > should return no documents.Seems reasonable.> I've put a patch together to fix this issue, but it requires the > introduction of a new query operator, to mark a query as matching no > documents (as opposed to a query created with the default query > constructor, which represents an undefined query).You can actually get away without this by using "X AND_NOT X" where X is ideally a term which probably doesn't index anything. Just an interesting observation really, I'm not suggesting we should use this hack (though Omega used to long long ago).> I've called this operator "OP_MATCH_NOTHING", and it takes no > subqueries.I don't think this should be an operator. It doesn't operate *on* anything, so I think this is an unnatural place to put this. I know we have OP_LEAF internally, but that's an implementation detail and we deliberately don't expose it externally, instead we have a constructor to create a leaf Query object.> I believe this should be public, since it may be useful for people trying > to write their own query parsers, rather than relying on the builtin query > parser.Agreed.> It's possible that a similar approach would be a neat solution for > representing "alldocument" queries. Currently, a special query term can be > created which matches all documents by creating a leaf query for which the > term is the null string. This is a somewhat "magic" and unobvious approach > - instead, we could introduce an "OP_MATCH_ALL" nullary operator, which > would be converted to a postlist which matches all documents. It's not > clear why an empty term should magically match all documents, rather than > none, or indeed why it should have any special meaning.I agree that Xapian::Query("") isn't totally obvious, but I don't think a "nullary" (ick, what an awful word) operator is any better. We already have a "query which matches nothing" - Xapian::Query(). You can run a match on it (and get an empty MSet), but you can't currently compose it with anything else. I think a better approach would be to allow Xapian::Query() to be composed with other queries. I know we used to allow this and it all got out of hand, but that was because we tried to make it compose in magic ways (e.g. Query() AND X would be X). I'm just proposing that this works in the obvious was for an empty query, and you've already implemented this anyway! So in omquery.cc, instead of throwing an exception "Can't compose a query from undefined queries" in various places, we would just call: internal->add_subquery(Query::Internal(OP_MATCH_NOTHING)); And then add two standard constant objects to the API: Xapian::Query::MatchAll Xapian::Query::MatchNothing To create these in include/xapian/query.h we'd have: namespace Xapian { // ... class Query { // ... static Query MatchAll(""); static Query MatchNothing(); //... Conceptually these sit much better as constants than operators.> + case '!': { > + return new Xapian::Query::Internal(Xapian::Query::OP_MATCH_NOTHING, 0); > + }I think the remote protocol version probably should be incremented because of this change. I know this is compatible unless you use the new feature, but it's better to moan up front than fail on particular queries. Most people will want to update both ends in step anyway I suspect. Cheers, Olly
Possibly Parallel Threads
- [PATCH] Add set_max_wildcard_expansion method to the queryparser.
- How to filter search result with query with has white space.
- How to filter search result with query with has white space.
- How to make QueryParser select entire word like "H.O.T"
- Xapian Benchmark results