aarsh shah
2013-Mar-11 20:25 UTC
[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework
Hello guys.I am working on implementing the PL2 weighting scheme of the DFR framework by Gianni Amati. It uses the Poisson approximation of the Binomial as the probabilistic model (P), the Laplace law of succession to calculate the after effect of sampling or the risk gain (L) and within document frequency normalization H2(2) (as proposed by Amati in his PHD thesis). The formula for w(t,d) in this scheme is given by::- w(t,d) = wqf * L * P where wqf = within query frequency L = Laplace law of after effect sampling =1 / (wdfn + 1) P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log (2 * pi * wdfn) wdfn = wdf * (1+c * log(average length of document in database / length of document d )) (H2 Normalization ) lamda = mean of the Poisson distrubution = Collection frequency of the term / Size of the database and the base of all logarithms is 2. c is a constant parameter The code is almost complete but I am stuck at a few places which are as follows:- 1.) Calculating the upper bound of the weight for the get_maxpart( ) function This one calculation has been giving me sleepless nights for a couple of days now.The problem is that L is a decreasing function for wdfn and P as per my calculations is a increasing function . I arrived at this conclusion because the derivative of L is always negative and the derivative of P is always positive (In the derivative of P, log (lamda) will always be negative as in his thesis,Amati states that for the PL2 model, collection frequency of term << Collection Size and so lamda will always be less than one .) .So, in order to find the upper bound,I simply substituted wdf=1 for L and used wdf = wdf_upper_bound for P and multiplied them by using upper doc length bound and lower doc length for wdfn of L and P respectively.However,this does not give that tight a bound.Not a word has been spoken about upper bounds on DFR weights in Amati's thesis or on his papers on DFR .I even tried differentiating the product of L and P and equated that to zero as discussed on IRC but that landed me with a complicated equation with no answer in sight.Please tell me what you'll think. 2.) Documentation This scheme belongs to a family of weighting schemes.Please do let me if I need to write any additional documentation to introduce this new framework and the new weighting schemes. Please do let know if you'll need additional information to help me out.Want to finish this scheme and move on the DPH scheme,which is yet another interesting weighting scheme of the DFR framework. -Regards -Aarsh -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130312/4dca0543/attachment.htm>
Olly Betts
2013-Mar-12 14:08 UTC
[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework
On Tue, Mar 12, 2013 at 01:55:51AM +0530, aarsh shah wrote:> Hello guys.I am working on implementing the PL2 weighting scheme of the DFR > framework by Gianni Amati. > It uses the Poisson approximation of the Binomial as the probabilistic > model (P), the Laplace law of succession to calculate the after effect of > sampling or the risk gain (L) and within document frequency normalization > H2(2) (as proposed by Amati in his PHD thesis). > > The formula for w(t,d) in this scheme is given by::- > > w(t,d) = wqf * L * P > where > wqf = within query frequency > L = Laplace law of after effect sampling =1 / (wdfn + 1) > P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log > (2 * pi * wdfn) > wdfn = wdf * (1+c * log(average length of document in database / > length of document d )) (H2 Normalization ) > lamda = mean of the Poisson distrubution = Collection frequency of > the term / Size of the database > and the base of all logarithms is 2. > c is a constant parameterI think the formula you have there for wdfn is wrong - http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels gives LaTeX source for what it calls tfn, which is your wdfn: tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l}) So the '1+c*' bit is *inside* the log, i.e.: wdfn = wdf * log(1 + c * average_doc_length/length_of_d) That agrees with Amati's slides from ESSIR 2007, except that he says "ln" not "log" in the tfn formula, which suggests log to the base e. The base of the log would only make a factor difference to wdfn, but that seems like it would affect the overall weight in a more complex way. Amati's phd thesis also has ln (equation 6.10 on page 118), and that ln is necessary for the H1 approximating H2 for large l observation to work (end of page 118).> The code is almost complete but I am stuck at a few places which are as > follows:- > > 1.) Calculating the upper bound of the weight for the get_maxpart( ) > function > This one calculation has been giving me sleepless nights for > a couple of days now.The problem is that L is > a decreasing function for wdfn and P as per my calculations > is a increasing function . I arrived at this conclusion > because the derivative of L is always negativeOK so far.> and the > derivative of P is always positive (In the derivative of P, log (lamda) > will always be negative as in his thesis,Amati states that > for the PL2 model, collection frequency of term << Collection > Size and so lamda will always be less than one .)I'm not sure I totally follow you here. Differentiating P with respect to wdf I get: Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)} Where Q = log(1+c*avg_l/l) which is > 0 since avg_l, l, and (I assume) c are all > 0, so Q doesn't affect when this does/doesn't become 0. If lambda < 1, then as you say (- log lamba) is positive, but there's the -log(e) and also can't wdfn be < 1, in which case log(wdfn) and ln(wdfn) will be negative.> .So, in > order to find the upper bound,I simply substituted wdf=1 for L > and used wdf = wdf_upper_bound for P and multiplied them by > using upper doc length bound and lower doc length > for wdfn of L and P respectively.However,this does not give > that tight a bound.Having an upper bound is a good start.> Not a word has been spoken about > upper bounds on DFR weights in Amati's thesis or on his > papers on DFR .The upper bounds are only needed because of some optimisations we do, so I'm not surprised if they aren't discussed elsewhere.> I even tried differentiating the product of > L and P and equated that to zero as discussed on IRC but that > landed me with a complicated equation with no answer > in sight.Please tell me what you'll think.I got that down to this, but indeed it doesn't look easy to solve: log(w) + w*(w+1)*log(e)/log(w) = log(2*pi)+2*log(e*lamba)+lamba*log(e) I'm not all that confident that's even right either.> 2.) Documentation > > This scheme belongs to a family of weighting schemes.Please > do let me if I need to write any additional documentation > to introduce this new framework and the new weighting > schemes.I don't think we need a lengthy document about them, but it would be useful to refer people to where the schemes are defined, and also briefly mention when they might want to use a particular scheme - for example, PL2 is apparently good if you want early precision. Cheers, Olly