On Wed, Nov 30, 2005 at 03:03:52PM -0800, Eric Parusel
wrote:> I noticed that @#$%^ gets split on when in a query, but &
doesn't...
>
> eg: search term: aaa!bbb@ccc#ddd$eee%fff%ggg^hhh&iii
> query: Xapian::Query((aaa:(pos=1) OR bbb:(pos=2) OR ccc:(pos=3) OR
> ddd:(pos=4) OR eee:(pos=5) OR fff:(pos=6) OR ggg:(pos=7) OR
> hhh&iii:(pos=8)))
Yes, the rationale is that company names are often abbreviated to
terms containing "&", such as "AT&T",
"M&S", "C&W", etc.
We also keep trailing "+" (e.g. C++), "#" (e.g. C#), and -
(e.g. SO42-,
as in a sulphate ion). The last is of debatable benefit, and has a
tendency to glue a hyphen onto the preceding word when the author
misses out a space, or from hardcopy formatted text where words are
hyphenated over linebreaks. I've been wondering about dropping that
rule.
There's special handling for capital letters with dots between (so
I.B.M. is treated as IBM).
And I think we split on everything else which isn't alphanumeric, then
generate phrase searches when query terms are separated by one or
more of .-/':\_@ which covers contractions ("doesn't" etc),
common urls,
most email addresses, ip addresses (both v4 and v6), hostnames,
filenames, identifiers (like LD_LIBRARY_PATH), classnames in most OO
languages.
> Is there somewhere where this is documented? I'd like to try to match
> up my importing "splitting" to the queryparser.
I don't think there's fully detailed documentation for this currently,
mostly because I'm planning to revisit this area. The current strategy
results in a phrase search in cases where it's not needed and where a
phrase search is rather resource intensive.
So you'll have to look at the source code currently. The file you
need to match the behaviour of is indextext.cc in the Omega sources.
If you're coding in C++, it's probably best to just use those routines
directly.
Cheers,
Olly