Gaurav Arora
2012-Apr-27 04:58 UTC
[Xapian-devel] Handling Negative value due to logarithm of probabilities.
Hi, In continuation of the discussion of melange comments,about negative value returned in matcher due to logarithm of probabilities. *I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0, and this change will only affect really low probability terms (those with Pi < 1/K, so you can adjust K to suit):* *W' = sum(i=1,...,n, max(log(K.Pi), 0))* Did you mean for low probability the the value returned by log(K.Pi) would be negative. So replace lower probability, which still gives negative value by 0? Assigning 0 will be equivalent to rejecting term from the query completely,which hurts the retrieval performance in Language Model as the term missing from the document are smoothened with collection frequency. I think we must try the smoothing from collection statistics if the document term probability doesn't work(is generating negative value). *sum(* *i=1,...,n, if( max(log(K.Pi), 0) == 0)* *max(max(log(K.Pcollec.i),0)* *else* ***log(K.Pi)* *)* In case both doesnt work return 0 would be only option . Moreover selecting a large enough K would be a tricky task as as no K would be large enough since log(x) -> -inf as x -> 0 Should we approach selecting value of K by statistically, i will mean to run the unigram Weighting scheme on large collection and observing lowest probability which could be found and hence approximating the value of K or any other method. I asked same Question on Stack overflow about this. http://goo.gl/ykwN4 They suggested: *"Could you simple take the negative of the logarithm? Since you are dealing with probabilities (i.e. values <= 1), the logarithm is always negative, so negating it will always make it positive."* But this approach wont be a good idea as large values will indicates low probabilites,small values will indicate high probabilities.Hence matcher will tend to skip some good documents from ranked list due to lower weight. Thanks, *-- * with regards Gaurav A. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120427/63df3bb6/attachment.html>
Olly Betts
2012-Apr-27 06:29 UTC
[Xapian-devel] Handling Negative value due to logarithm of probabilities.
On Fri, Apr 27, 2012 at 10:28:30AM +0530, Gaurav Arora wrote:> In continuation of the discussion of melange comments,about negative value > returned in matcher due to logarithm of probabilities. > > *If we make K suitably large, we could clamp each log(K.Pi) to be >= 0, > and this change will only affect really low probability terms (those with > Pi < 1/K, so you can adjust K to suit):* > > *W' = sum(i=1,...,n, max(log(K.Pi), 0))* > > Did you mean for low probability the the value returned by log(K.Pi) would > be negative. So replace lower probability, which still gives negative value > by 0?Yes, that was my thought.> Assigning 0 will be equivalent to rejecting term from the query > completely,which hurts the retrieval performance in Language Model as the > term missing from the document are smoothened with collection frequency.Isn't having a negative value here rejecting it as least as much? By clamping those negative values of log(K.Pi), it seems to me that we're increasing the amount of importance we attribute to such terms, not rejecting them. If you consider the original product, what we're doing is: product(i=1,...,n, max(Pi, 1/K)) I understand that in the ngram LM models, it's common to substitute a non-zero value for the frequency of an ngram which isn't present, which suggests the above isn't too unreasonable a thing to do.> I think we must try the smoothing from collection statistics if the > document term probability doesn't work(is generating negative value). > > *sum(* > > *i=1,...,n, if( max(log(K.Pi), 0) == 0)* > > *max(max(log(K.Pcollec.i),0)* > > *else* > > ***log(K.Pi)* > > *)* > > In case both doesnt work return 0 would be only option .Yes, perhaps that's better.> Moreover selecting a large enough K would be a tricky task as as no K would > be large enough since log(x) -> -inf as x -> 0Well, I'm not saying we should try to pick K such that we never clamp, just large enough that the clamping is fairly rare. How is Pi actually calculated? I'm not sure I've seen that detail anywhere.> Should we approach selecting value of K by statistically, i will mean to > run the unigram Weighting scheme on large collection and observing lowest > probability which could be found and hence approximating the value of K or > any other method.My thought was we'd try different values and see how it affected retrieval performance and runtime, and either choose an appropriate value, or make it user-specifiable with a sensible default.> I asked same Question on Stack overflow about this. > > http://goo.gl/ykwN4 > > They suggested: > > *"Could you simple take the negative of the logarithm? Since you are > dealing with probabilities (i.e. values <= 1), the logarithm is always > negative, > so negating it will always make it positive."* > > But this approach wont be a good idea as large values will indicates low > probabilites,small values will indicate high probabilities.Hence matcher > will tend to skip some good documents from ranked list due to lower weight.Yes, negating won't help, unless we also change the matcher to look for low values not high ones, which would be quite tricky I think. It also means we'd end up with a value between 0 and infinity, whereas we want to have an upper bound if at all possible. Cheers, Olly