I wrote:> I think it's a bug. Or at least QueryParser uses a rather delicate
rule
> for when to add a ":" between the prefix and the term, which
scriptindex
> doesn't implement. The rule is undocumented (except in the code) so
> it's arguable who is correct.
I've been looking at this some more.
We need some way to distinguish the term prefix from the term itself.
The scheme Omega uses is that a single upper case letter is a term prefix,
unless it's an X. An X signals a longer term prefix.
So the question really is when such a prefix ends:
* omindex doesn't create such prefixes, so has no real view on the
matter. It does create single character prefixes, and just appends
the value to them.
* scriptindex takes an optional "prefix" for boolean and text
indexing.
When text indexing, this prefix is simply prepended to the term, except
that raw terms (which already get an R prefix) get a ":" inserted
between the specified prefix and the R prefix if the specified prefix
is 2 or more characters long and doesn't end in a ":" already.
All
the examples for scriptindex don't include an explicit ":".
For boolean terms, the prefix is simply prepended to the term.
So "Olly" with prefix "XABC" is indexed as
"XABColli" and "XABC:Rolly".
The rationale for this is that otherwise prefixes "XABC" and
"XABCR"
get confused. Note that scriptindex doesn't enforce the "X"
rule for
multi-character prefixes.
* omega only looks at prefixes directly when handling "B" parameters.
Terms with the same prefix are OR-ed, then groups with different
prefixes are AND-ed. E.g.
(Ttext/plain OR Ttext/html) AND Hwww.xapian.org
For this, it assumes a single character prefix unless it starts with
X, in which case it takes the longest all uppercase prefix. If
there's an ':' after this, it is ignored (not part of the prefix
or the value).
* Xapian::QueryParser uses this code:
if (prefix.length() > 1) {
unsigned char back = prefix[prefix.length() - 1];
if (back != ':') {
if (!C_isupper(back) || C_isupdig(term[0])) {
prefix += ':';
}
}
}
which doesn't match what the Omega indexers do especially well.
If the prefix is a single character, or already has a ":", this
doesn't
do anything.
For a multi-character prefix, this adds a ":" if the last character
of the
prefix isn't upper case (which is peculiar but harmless given the rules
everything else uses). It also adds a ":" if the term starts with
an
uppercase letter (good) or a digit (bad).
One issue is that what Omega chooses to do with prefixes is currently just
one way of using the library. Except for Xapian::QueryParser, the xapian-core
library really just treats terms as strings of bytes. Longer term, perhaps we
should look at supporting prefixes (and fielded searching in general) more
explicitly, even if it's by pushing a system similar to the above down into
the
library where we can hide the oddities behind the scenes better. In
particular, it would be nice to split "document length" per field so
you can
search just document titles or abstracts from papers with the appropriate
length corrections in the weights.
But in the short term, we want everything to be working in step. I think
the simplest fix (which in particular avoids requiring database rebuilds)
is to change Xapian::QueryParser to check C_isupper(term[0]) and drop the
C_isupper(back) test.
If you've specified an explicit ":" to scriptindex, you'll
also need to specify
it to Xapian::QueryParser (which currently you'll mostly get away with not
doing), but that's fair enough really.
I think scriptindex should perhaps also warn if you specify a multi-character
prefix which doesn't start with an "X", since Omega and
Xapian::QueryParser
won't necessarily handle it as you'd hope.
Cheers,
Olly