Richard Boulton
2007-Oct-19 11:21 UTC
[Xapian-devel] Re: [Xapian-commits] 9476: trunk/xapian-core/ trunk/xapian-core/include/xapian/ trunk/xapian-core/queryparser/ trunk/xapian-core/tests/
olly wrote:> SVN root: svn://svn.xapian.org/xapian > Changes by: olly > Revision: 9476 > Date: 2007-10-19 03:47:11 +0100 (Fri, 19 Oct 2007) > > Log message (14 lines): > include/xapian/queryparser.h,queryparser/queryparser.cc, > queryparser/queryparser.lemony,queryparser/queryparser_internal.h, > tests/queryparsertest.cc: Since calling QueryParser::add_prefix() > or QueryParser::add_boolean_prefix() a second time with the same > field name was ignored before (rather than overriding as we had > thought) it seems reasonable to change this behaviour. This > also avoids the need to deprecate these methods which will force all > users to update their code.I think leaving the old methods in place and undeprecated with these semantics is fine, but I'd argue the the new, 3 argument, form of add_prefix, and the prefix_type enum are good additions because they lead to future improvements (although, they're possibly not needed yet). I thought that the consensus on the query parser was that it should become more configurable, and one of the ways in which it should be more configurable is being able to generate terms according to the prefixes used in a more flexible way. In particular, it should be possible to decouple the status of a prefix as a "filter" prefix from the way in which terms are generated from the text. Currently, we generate terms in two ways: 1. By taking the text after the field specifier, and processing it as free text (stemming, lowercase conversion, end at punctuation or space, may generate phrases). 2. By taking the text after the field specifier, and processing it as a single unit (currently, lowercase conversion only, end at space, no phrases). We also use these terms in two ways: A. Involving the term in the query (so, joining it with surrounding terms according to boolean operators, and using the default operator if there aren't any such operators). B. Adding the term to a list of filter terms, which are then combined with the rest of the query to restrict the set of matching documents, but not used to affect rankings. Surrounding boolean operators are ignored or invalid (can't remember which). Currently, add_prefix(field, prefix) allows the combination of "1" and "A" above to be used, and add_boolean_prefix(field, prefix) allows the combination of "2" and "B" above to be used. There is really no reason that options 1B and 2A shouldn't be available: for example, option 2A on field "code" would allow something like "code:xkcd.2354/sad OR richard" to generate the query "XCODE:xkcd.2354/sad OR richard"; ie, be used as a normal weighted term, and included in boolean operators. My idea when introducing the prefix_type enum, and when naming its values as PREFIX_INLINE and PREFIX_FILTER, was that the prefix_type would determine how the term was "mixed" into the query (so, PREFIX_INLINE correlated to option "A" above, and PREFIX_FILTER correlated to option "B" above). My patch also hard-coded the combination of option A with "1" and option B with "2", but my thinking was that we could later add further parameters to add_prefix() specifying how the term generation should be done. Probably, this would simply involve adding a single extra parameter holding a "TermProcessor" class, which would generate a term, (or possibly a query), from a piece of text. There would be default TermProcessor classes which followed option "1" or option "2" above, and the user would be free to define further such classes. Now, it's reasonable to argue, since you're happy with changing the semantics of add_prefix and add_boolean_prefix, that we don't need the new form of add_prefix() just yet, since it doesn't add any new functionality; there's only any point in adding the new form of add_prefix() once we have an implementation which takes 4 arguments, allowing "TermProcessors" to be added. I'm not too bothered about this right now, but it does seem that add_boolean_prefix() will then become a limited form of add_prefix() and would eventually be deprecated to clean up the API, so adding a 3 argument form of add_prefix() now to allow users to start using it instead of add_boolean_prefix() doesn't seem unreasonable either. The main concern I have is to check that we're still heading in the same direction: making the query parser more configurable, and decoupling the status of a prefix as an inline or a filter prefix from the way in which term processing is done. However, I would say that changing from using an enum internally to using a boolean value for the prefix type seems like a backwards step - use of an enum makes the code more readable, in my opinion. I'd be happier if the enum was still used internally (though not exposing it in the API until we have a 3 or 4 argument form of add_prefix() is perfectly sensible). -- Richard
Olly Betts
2007-Oct-19 14:53 UTC
[Xapian-devel] Re: [Xapian-commits] 9476: trunk/xapian-core/ trunk/xapian-core/include/xapian/ trunk/xapian-core/queryparser/ trunk/xapian-core/tests/
On Fri, Oct 19, 2007 at 11:20:49AM +0100, Richard Boulton wrote:> olly wrote: > >Since calling QueryParser::add_prefix() > >or QueryParser::add_boolean_prefix() a second time with the same > >field name was ignored before (rather than overriding as we had > >thought) it seems reasonable to change this behaviour. This > >also avoids the need to deprecate these methods which will force all > >users to update their code. > > I think leaving the old methods in place and undeprecated with these > semantics is fine, but I'd argue the the new, 3 argument, form of > add_prefix, and the prefix_type enum are good additions because they > lead to future improvements (although, they're possibly not needed yet).The API as it is after this change is the original one you committed (with one small semantic change). You only changed it because I pointed out that the behaviour it changed might actually be being used. But I had thought that setting the same field to a different prefix changed it, whereas actually it's ignored, so the whole premise for changing it was flawed. I tried to bring this up on IRC, but got no response, so I went ahead as it didn't seem controversial (it was your original API after all), and SVN HEAD is currently much too far from being in a releasable state, which I want to address. Ideally we should be able to make a release at short notice if we need to.> 2. By taking the text after the field specifier, and processing it as a > single unit (currently, lowercase conversion only, end at space, no > phrases).FWIW, it also ends at a close bracket to avoid swallowing them when used at the end of a bracketted subexpression.> [...] I'm not too bothered about this > right now, but it does seem that add_boolean_prefix() will then become a > limited form of add_prefix() and would eventually be deprecated to clean > up the API, so adding a 3 argument form of add_prefix() now to allow > users to start using it instead of add_boolean_prefix() doesn't seem > unreasonable either.I think it's better not to try to predict the API we'll want here before we actually try to implement it. We may get it wrong and force users to update their add_prefix() calls multiple times. Deprecating a method in favour of another has a cost for everyone using it, so it's not something we should do too lightly.> The main concern I have is to check that we're still heading in the same > direction: making the query parser more configurable, and decoupling the > status of a prefix as an inline or a filter prefix from the way in which > term processing is done.I'm not sure I'd use the word decouple, as there are essential differences between the two (for example, it doesn't make sense to have the empty field prefix being a filter, nor to have a phrase of filters). But certainly this should be much more configurable.> However, I would say that changing from using an enum internally to > using a boolean value for the prefix type seems like a backwards step - > use of an enum makes the code more readable, in my opinion.FWIW, I find it slightly more readable as currently used - the choice is essentially: foo->filter versus: foo->type == TYPE_FILTER It also avoids the need to Assert(foo->type == TYPE_INLINE); after dealing with the boolean filter case. But the main reason I changed it was that I find "TYPE_INLINE" a rather confusing name, and couldn't think of a better one. To me, TYPE_INLINE suggests a boolean filter inline in the query string (constrasted with a boolean filter outside the query (e.g. in an HTML SELECT). Cheers, Olly