On Fri, Oct 07, 2005 at 11:53:17PM -0700, Eric Parusel
wrote:> I don't believe Xapian supports wildcards on both sides of a term,
correct?
> Is this something that is technically unfeasable, unpalatable for
> performance reasons, or just not on the roadmap because it's a pain? :)
Taking "xap*" as an example, the way that right truncation is
implemented is that we find the prefix "xap" in the
"lexicon" (list of
all terms in the index), and step forward one term at a time until we
get to a term which doesn't have the prefix "xap". Then we OR
them all
together.
That's fairly efficient - the lexicon here is actually the postlist
table keys, and the postlist table is stored as a sorted Btree. So
after the initial find, stepping forward will usually be in the same
block or involve a single extra block read. Also we'll need to read
many or all of those same blocks to actually run the query.
It's possible to extend this to allow internal wildcards too - so a
search for "xap*n" could be handled in much the same way, except we
discard terms which don't end in an n. It would be less good if
the first letter was common and the first wildcard in the second
position, such as "e*phant", since that would have to look at all
terms starting with an e.
But this approach is simply unworkable for left truncation - finding all
terms matching "*ian" requires running through the whole postlist
tree.
On a large database that's simply unfeasible.
Right truncation and wildcarding are mostly useful in cases where a
word has variant spellings (e.g. color/colour) or you're unsure how to
spell something (e.g. a name you've only heard spoken). Right
truncation is also used for stemming (since in many languages the
inflections stemming removes are at the end of words). Left truncation
has fewer uses so is much less commonly asked for.
So the reason we don't have it is that it's harder to implement
efficiently, and not commonly requested.
> I would generally only use it for shorter strings, like email headers.
> I *could* potentially use right-truncating in sequence as below, but
I'm
> not sure if this is too insane.
> terms indexed: Axapian Aapian Apian Aian Aan
> search for: A<user_inputted_partial_term>*
> Which would obviously use alot of terms!
Doing the work at index time is the right approach, but the trick to use
is to index your terms reversed and use right truncation!
So generate the term "Anaipax" and look for terms starting
"Anai". You
can support right truncation in the same database by generating both
terms with a different prefix or some way to distinguish them (e.g. use
a ! to signify reversed, so "Axapian" and "A!naipax").
If you want to allow wildcards at both ends at the same time (e.g.
allowing a search for "*api*"), you'll need to do a fair bit of
work
outside Xapian. Even if it's within prefix "A", you'll need
to scan all
such prefixed terms.
The best approach is probably n-grams. You create a second index where
the "documents" are terms from the main index, and these are indexed
by
n-grams (substrings of length n). So "xapian" might be indexed by
"^x",
"^xa", "xap", "api", "pia",
"ian", "an$", and "n$" (and perhaps all the
bigrams and monograms too if you're doing this for real rather than
typing in an example!)
Then you can find all terms matching "*api*n" by searching for
"api" AND
"n$", and filtering the matching terms to remove any false positives
(xapian would also match *ian*xap* for example) and combine those left
with OR to make the subquery for "*api*n" when searching the main
database.
You need to make sure the second index is updated when a new term is
added to the main index. Tracking deleted terms isn't critical for
this use - such terms won't match anything in the main database anyway.
The same n-gram index can actually be used for spelling correction too -
search for all the n-grams from a possibly misspelled term and the
results are a ranked set of terms they might want.
Looking at the "edit distance" is another technique used for spelling
correction - that's the number of character inserts, deletes, or
transpositions between two terms; but that's a pairwise measure and
probably too expensive to calculate for every term in the lexicon, so
you probably need to generate a candidate set of terms to calculate it
for, and so need something like n-gram matching as a first step).
As for roadmaps, I'm afraid left truncation is low on my list of things
to work on. Spelling correction overlaps somewhat and is higher but I'm
trying to concentrate on flint right now. But if you want to work on it
yourself, I'm happy to give pointers on where to hook in to Xapian. And
if done cleanly, I think it's something that's worth including in
Xapian.
Cheers,
Olly