On Mon, Dec 07, 2009 at 05:18:41PM +0000, Nick Griffiths
wrote:> For example, say we have a set of document titles, like so:
>
> [1] - "welcome to the jungle",
> [2] - "welcome to thesaurus",
> [3] - "welcome all to the jungle",
> [4] - "all jungle"
>
> When passing the following queries to parse_query with QUERY_PARTIAL, we
> are seeing...
>
> w -> 1, 2, 3
You probably don't want to try to offer completions for a single character
query (just because there are generally too many possibile completions to
be able to do so usefully).
> welcome to the -> 1, 2, *3
> jung -> 4, 1, 3
> all jungle -> 4, *3, *1
I don't see how the last example can match 1 - is that a typo?
But otherwise this is as expected - what FLAG_PARTIAL does is to wildcard
expand the last term if it abuts the end of the query string.
> Which for our purposes is returning non-matching results (marked with an
> asterisk). If we switch to using the QUERY_PHRASE flag, we avoid the
> extraneous results, but we lose the partial results. Also, with a large
> database (4mil documents) the first query takes an incredibly long time.
There isn't a QUERY_PHRASE flag - there is FLAG_PHRASE (and you wrote
QUERY_PARTIAL above where you meant FLAG_PARTIAL), but that controls
support for quoted phrases so shouldn't change anything here.
Do you mean you are setting the default op to OP_PHRASE? That's not going
to interact efficiently with FLAG_PARTIAL when the last term has many
possible completions. Also it defaults to a window size such that the
words don't have to be adjacent (so `welcome jungle' will match 1 and
3).
If OP_PHRASE were improved to work more efficiently with subqueries which
weren't just terms this might be a viable approach - the window size could
be made configurable fairly easily.
> One idea is to add all possible incomplete versions of each term to the
> document's term list. For the term 'welcome' we'd also add
'Sw',
> 'Swe', 'Swel' ... 'Swelcome'. That way we can
adjust the user's query
> to insert the prefix on the final word and then search via a standard
> phrase search. Not sure what the performance or index size implications
> of this might be, but we'd be willing to have a shot if this is a good
> approach.
Modifying the user's query string before parsing isn't something we
recommend as you need to duplicate the behaviour of QueryParser to do
it safely.
It is likely to take a lot of space, as you are going to end up generating
a lot of terms and need to index them all with positions.
It's quite a lot like the wildcard acceleration idea here, but that would
have the advantages of note having to mess with query strings, and not
having to index every possible completion:
http://trac.xapian.org/ticket/207
Cheers,
Olly