On Mon, May 15, 2006 at 03:57:13PM -0700, Alexander Lind
wrote:> Is there a way to do adaptive query scoring (as in popular results
> returned by a query should get more weight because they are getting
> clicked more often) in xapian? Is this what the rset class should be
> used for?
You could use the RSet to achieve something like this by recording
which documents users like for which queries and setting an RSet from
that when there's a query for the same terms. It would probably
make sense to use a second Xapian database to store the queries matching
each document click so you'd run a search on that to find what to set
the RSet as on the main database.
> I could write a php app to do adaptive results scoring for separate
> words (just recording the clicks and then have a cron:ned script add
> weight to the document_id:s for the recorded words)
That would be another way - you could add a prefixed term (e.g.
XCLICKfoo) to those documents which the user selected when they
had searched for "foo". Then turn a search term "foo" into
(foo ANDMAYBE XCLICKfoo) (must match foo, if XCLICKfoo also matches
add the weight from that.)
> but I don't see a
> clear way to do this for phrase searches (as in document_id x should get
> more weight if the search is for (and only for) 'xbox console', not
> 'xbox' or 'xbox games' or whatnot.
I'm not totally sure that matters - for the example you give, there's
going to be a very strong correlation. There certainly are words which
have many meanings where there's less correlation (e.g. 'stock
market'
vs 'vegetable stock') and even word order can make a big difference
(e.g. 'oil bath' vs 'bath oil'). But for the 'stock'
example, a query
for just 'stock' could useful promote results from both, and a query
for 'stock market' would have 'market' in too, so although the
cookery
pages would get a boost, the financial pages would get larger one.
In fact, I suspect you would improve retrieval overall simply by
favouring pages which somebody has clicked on for some query (especially
for a search over random web sites - the web is full of useless junk
which nobody will ever want in their results). That approach is
particularly susceptible to "clickbot" abuse though.
But anyway, if you want to work with phrases, the hard part is to decide
what's a phrase. Then just generate a term for the phrase e.g.
"XCLICKxbox console". If you're going to treat the whole query as
a
phrase, I'd suggest you try generating terms from adjacent word pairs
(so 'natural history museum' gives "XCLICKnatural history" and
"XCLICKhistory museum").
I'd love to hear how you get on.
Cheers,
Olly