thr3ads.net - Xapian discuss - [Xapian-discuss] queryparser characters question [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Eric Parusel

2005-Nov-30 23:04 UTC

[Xapian-discuss] queryparser characters question

> http://www.xapian.org/docs/queryparser.html
>
> Phrase searches
>
> A phrase surrounded with double quotes ("") matches documentscontaining that exact phrase. Hyphenated words are also treated as
phrases, as are cases such as filenames and email addresses (e.g.
/etc/passwd or president@whitehouse.gov).

I'm a little unsure what gets split up and what doesn't.

I noticed that @#$%^ gets split on when in a query, but & doesn't...

eg: search term: aaa!bbb@ccc#ddd$eee%fff%ggg^hhh&iii
query: Xapian::Query((aaa:(pos=1) OR bbb:(pos=2) OR ccc:(pos=3) OR
ddd:(pos=4) OR eee:(pos=5) OR fff:(pos=6) OR ggg:(pos=7) OR
hhh&iii:(pos=8)))

Is there somewhere where this is documented?  I'd like to try to match
up my importing "splitting" to the queryparser.

Thanks!,
Eric

Olly Betts

2005-Dec-01 01:09 UTC

head link

[Xapian-discuss] queryparser characters question

On Wed, Nov 30, 2005 at 03:03:52PM -0800, Eric Parusel
wrote:> I noticed that @#$%^ gets split on when in a query, but &
doesn't...
> 
> eg: search term: aaa!bbb@ccc#ddd$eee%fff%ggg^hhh&iii
> query: Xapian::Query((aaa:(pos=1) OR bbb:(pos=2) OR ccc:(pos=3) OR
> ddd:(pos=4) OR eee:(pos=5) OR fff:(pos=6) OR ggg:(pos=7) OR
> hhh&iii:(pos=8)))
Yes, the rationale is that company names are often abbreviated to
terms containing "&", such as "AT&T",
"M&S", "C&W", etc.

We also keep trailing "+" (e.g. C++), "#" (e.g. C#), and -
(e.g. SO42-,
as in a sulphate ion).  The last is of debatable benefit, and has a
tendency to glue a hyphen onto the preceding word when the author
misses out a space, or from hardcopy formatted text where words are
hyphenated over linebreaks.  I've been wondering about dropping that
rule.

There's special handling for capital letters with dots between (so
I.B.M. is treated as IBM).

And I think we split on everything else which isn't alphanumeric, then
generate phrase searches when query terms are separated by one or
more of .-/':\_@ which covers contractions ("doesn't" etc),
common urls,
most email addresses, ip addresses (both v4 and v6), hostnames,
filenames, identifiers (like LD_LIBRARY_PATH), classnames in most OO
languages.
> Is there somewhere where this is documented?  I'd like to try to match
> up my importing "splitting" to the queryparser.
I don't think there's fully detailed documentation for this currently,
mostly because I'm planning to revisit this area.  The current strategy
results in a phrase search in cases where it's not needed and where a
phrase search is rather resource intensive.

So you'll have to look at the source code currently.  The file you
need to match the behaviour of is indextext.cc in the Omega sources.
If you're coding in C++, it's probably best to just use those routines
directly.

Cheers,
    Olly

Xapian discuss - Nov 2005 - queryparser characters question

[Xapian-discuss] queryparser characters question

[Xapian-discuss] queryparser characters question