CVS HEAD now has a rewritten QueryParser implementation. I've used
Lemon instead of Bison, which means that I've been able to eliminate
the static variables, so this version is reentrant - you can now
safely parse queries in several threads at once (each should have
its own QueryParser object though).
Lemon differs from Bison in that the lexer calls the parser (in Bison
the parser calls the lexer). The upshot of this is that the code is
a lot clearer than the old version was.
The stripped size of the compiled code is also slightly smaller. I've
not profiled it yet, but knowing the different approaches used I
wouldn't be at all suprised if it's also faster and uses less memory.
The new version supports a few things the old one didn't. For example,
you can now use + and - on phrases and brackets. Also quoted phrases
can now contain any punctuation - the old version was unreasonably
fussy about that.
I reviewed which punctuation characters acts as phrase generators. The
new version adds ':' to the list. There are many examples where
it's
appropriate:
Xapian::QueryParser (in C++ and Perl)
news.example.com:119 (port numbers)
mailto:olly@survex.com (URL schemes)
fe00::0 (IPv6 addresses)
12:55:43 (times)
C:\Windows (DOS/Windows drive letters)
And I can't actually see a case where it's undesirable. I also noticed
Google has ':' as a phrase generator.
I've currently stopped '*' from being a phrase generator. It was
originally added solely because someone at Ananova wanted "B*witched"
to
searchable, but if you search for 'b witched' as two separate words in
Google, you'll see that argument is hardly compelling. I guess it would
also allow you to search for censored "r*de w*rds". I found it hard
to
justify keeping it as a phrase generator. But I don't feel very
strongly.
You can now disable various features - for example, you can turn off
the boolean operators (AND, OR, NOT, and XOR), or quoted phrases.
There's no external API for this yet, but I think there probably should
be. But I'm not sure what level of control people will want. Is it
useful to be able to disable XOR and AND, but still have OR and NOT?
It's easy enough to implement such a fine level of control if it
actually seems useful.
This mechanism also means we can add right truncation (* wildcard at the
end of a term) and easily have it disabled by default.
I'm wondering if it's actually worth having the queryparser as a
separate library. It complicates linking, and the queryparser library
is about 3.5% of the size of the main library, so it's not like it's
saving much. If we think it's really worth having multiple shared
libraries, then it would make much more sense to split for indexing vs
searching.
Cheers,
Olly