No mention of where it came from, but:
----------------------------------------------------------------------
For the record, Lucene's scoring algorithm is, roughly:
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
where:
score_d : score for document d
sum_t : sum for all terms t
tf_q : the square root of the frequency of t in the query
tf_d : the square root of the frequency of t in d
idf_t : log(numDocs/docFreq_t+1) + 1.0
numDocs : number of documents in index
docFreq_t : number of documents containing t
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t : square root of number of tokens in d in the same field
as t
(I hope that's right!)
[Doug later added...]
Make that:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of
terms in query
The coordination factor gives an AND-like boost to documents that
contain, e.g., all three terms in a three word query over those that
contain just two of the words.
----------------------------------------------------------------------
<http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q31>
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james@tartarus.org uncertaintydivision.org
On Wed, Nov 10, 2004 at 11:29:24AM +0000, James Aylett wrote:> For the record, Lucene's scoring algorithm is, roughly: > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)Implementing a Xapian::Weight subclass for this would be pretty easy. I'm not sure if there's much point, though it might make a good worked example for documenting how to implement your own weighting scheme in Xapian.> Make that: > > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * > boost_t) * coord_q_d > > where > > boost_t : the user-specified boost for term t > coord_q_d : number of terms in both query and document / number of > terms in queryI suspect you'd need to tweak the matcher to allow coord_q_d to be used like this in Xapian. The matcher handles the components of the weight individually, and it needs to know them before it knows how many query terms match a particular document. It can sometimes reject a document based on partial weight information before it has even looked at whether all of the terms match (because it's possible that even if they all match, they can't give the document enough score to beat the best 10 (or how every many) already seen. Actually, you can return a very large value for the maximum weights, which will disable this optimisation. Matches will run a bit more slowly, but it would provide an easy way to evaluate Lucene's weighting scheme against BM25 and any other weighting scheme you can implement for Xapian. Alternatively, for an AND query you can just ignore this as it's then a constant factor. Cheers, Olly