On Wed, Nov 10, 2004 at 11:29:24AM +0000, James Aylett
wrote:> For the record, Lucene's scoring algorithm is, roughly:
>
> score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
Implementing a Xapian::Weight subclass for this would be pretty easy.
I'm not sure if there's much point, though it might make a good worked
example for documenting how to implement your own weighting scheme in
Xapian.
> Make that:
>
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> boost_t) * coord_q_d
>
> where
>
> boost_t : the user-specified boost for term t
> coord_q_d : number of terms in both query and document / number of
> terms in query
I suspect you'd need to tweak the matcher to allow coord_q_d to be used
like this in Xapian. The matcher handles the components of the weight
individually, and it needs to know them before it knows how many query
terms match a particular document. It can sometimes reject a document
based on partial weight information before it has even looked at whether
all of the terms match (because it's possible that even if they all
match, they can't give the document enough score to beat the best 10
(or how every many) already seen.
Actually, you can return a very large value for the maximum weights,
which will disable this optimisation. Matches will run a bit more
slowly, but it would provide an easy way to evaluate Lucene's weighting
scheme against BM25 and any other weighting scheme you can implement for
Xapian.
Alternatively, for an AND query you can just ignore this as it's then a
constant factor.
Cheers,
Olly