On Thu, Nov 04, 2004 at 07:30:31PM +0000, Andrew MacFarlane
wrote:> >We ought to do some testing to see whether our choice of BM25
parameters
> >is good or not. I don't recall how we chose them originally - it
may be
> >they came from Stephen Robertson, but it rather looks like we just used
> >the parameters which would give TradWeight, but with 0.5 for D which
> >gives the halfway point between BM11 and BM15.
Reading through a few of the papers Stephen Robertson has coauthored
which talk about BM25, I've found suggested default parameter values of:
b (which Xapian calls D) "around 0.75" (we default to 0.5)
k1 (which Xapian calls B) 1.2 (we default to 1)
k3 (which Xapian calls A) "7 or 1000 (effectively infinite)" (we
default to 1)
I haven't yet seen any suggested default for k2 (in Xapian C = 2 * k2),
although some papers don't mention the extra term which uses this
constant and that is equivalent to using k2 = 0 (which is Xapian's current
default).
Incidentally, I think it's confusing that Xapian has a unique naming scheme
for the parameters, while most other references are consistent in their
choice of names. I think we should use the standard names instead.
We could just change the parameter names, which would keep compatibility
with existing code - this would make the constructor look like this:
BM25Weight(double k3, double k1, double k2, double b, double min_normlen)
If we wanted to put k1, k2, k3 in a more natural order, we need to break
compatibility (which will probably affect very few people as I suspect
most use the default parameters) or find a way to support the old parameter
ordering (such as adding a new constructor with a different signature,
or having a new name for the updated class - Xapian::NewBM25Weight or
something).
Any thoughts?
> >Andy MacFarlane did some evaluation work in the BrightStation days, but
>
> yes, but I didn't do any comparisons with other weighting schemes. I
did do a
> lot of parameter tuning, but any conclusions I drew from it would not be
that
> useful as these parameters vary from collection to collection.
It'd be good to know that the default parameters Xapian uses for BM25
are a good general purpose choice though. Most people are likely to
just go with the defaults, and they'll certainly affect people's initial
impressions.
Incidentally, do you know if anyone has looked at whether you can
automatically tune the constants by looking at statistics the computer
can calculate about the collection?
Cheers,
Olly