Olly Betts <olly <at> survex.com> writes:
>
> On Sat, Feb 24, 2007 at 02:35:10PM +0800, Robert Young wrote:
> > Pardon me if this has been asked before, is there a way to do a sort
by:
> >
> > (relevance + constant * value)
> >
> > efficiently? Is this planned or feasible?
>
> It's not currently possible, but provided you can give an upper bound
> for value, it's possible to implement and should be pretty efficient
> if the bound is reasonable.
>
> I looked at the idea some time ago, and the experimental "match
bias"
> was the result - this is hardwired to add a weight which decays
> exponentially with date (the idea being that for something like a news
> search, more recent articles will tend to be more relevant).
>
> However, Rusty Conover's patch to add an
"ExternalSourcePostList" would
> allow a general implementation of this idea:
>
> http://thread.gmane.org/gmane.comp.search.xapian.general/4061
>
> It just needs to gain the ability to return values for get_weight() and
> get_maxweight(), and then you can implement a subclass which indexes
> every document and returns "constant * value" as the weight.
This can
> then be ANDed with the query to obtain the desired result.
>
> > Incidentally, I believe by a very crude simplification, relevance +
> > pagerank is what Google is using?
>
> They don't openly document what they use, but it's presumably some
> function of statistical relevance and pagerank. But other factors may
> be involved, and it might not be a linear combination.
>
> Cheers,
> Olly
>
I am using the python bindings and trying to do something similar to below.
In particular I want to be able to normalize the relevance and
value scores so I can do a weighted sum for the final document weighting.
Correct me if I am wrong, but subclassing PostingSource will allow me to get
weights based on document values, but there is not way to determine a
multiplier that is scaled based on the relevance weighting?
I am looking at adding an operator that is similar to OP_AND_MAYBE, but does a
normalized weighted sum rather than a simple addition of the left and right
posting lists values.
Am I missing some easy subclassing implementation?
Thanks in advance.
Cheers,
Doug