On Thu, Oct 28, 2004 at 03:46:38PM +0100, James Aylett
wrote:> Kevin Burton has posted about poor ranking in Lucene preferring
> shorter documents over longer ones[1].
It's not totally clear he's right actually. After all, which would you
prefer - a document which tells you what you want to know? Or 3 copies
of the same document appended to each other?
The example is just a bit too artificial.
> Anyone know what Lucene is doing here? Their FAQ doesn't mention what
> weighting scheme they use, and I don't have time to investigate
> further right now ...
I'd guess there's a mechanism in Lucene's weighting scheme to
counteract
the natural tendency of weighting schemes to prefer long documents (as
there is in BM25 which Xapian uses). Without such a mechanism long
documents will tend to rank highly because they tend to have high
within-document-frequency. Long documents match disproportionately many
queries anyway (because they typically contain more distinct words). It
sounds like perhaps their mechanism is a little too aggressive.
Incidentally, it's odd that a document containing only one occurence of
the only search term doesn't match at 100% in Lucene. How much better
could it be? (Well, actually "foo" is arguably a totally useless
result
since I already know it - but I doubt that idea is built into their
weighting scheme).
Cheers,
Olly