aarsh shah
2013-Feb-07 20:07 UTC
[Xapian-devel] Ideas for allowing specification of weighing scheme for Eset
Hey guys ,Hi :) I am working on a hack which will allow the user to specify a weighing scheme (along with the parameters , if he does not not want to use the default values) to build the Eset (rather than using the hard coded TradWeight scheme with default k=1 ) as Olly had suggested that we can probably get better terms (a more relevant Eset) for query expansion if we use say something like BM25 (or allow the user to use a self coded scheme) for ranking the terms . I read up the code for the proxy,internal and iterator classes of Eset and Mset to get a feel of how those sets work.I then traced the working of Enquire::get_eset( ) (understood it well other than how a Termlist tree is built ) and Enquire::get_mset( ) (didn't understand this one completely,got lost during Multimatch::get_mset()) .I also read up the code for Xapian::Weight (both proxy and internal class) and the codes of BM25 and TradWeight classes . The hack now seems fairly straightforward as the only difference between BM25 and TradWeight (as far as ranking terms to build an Eset is concerned) is the replacement of ( k1*L + f ) by ( k1 ( b*L + (1-b) ) in the denominator because it seems to me that as we are ranking terms based on documents ( rather than the other way round ), we do not need to include components like q/(k3+q) (because we do not wish to include terms already present in the query into the Eset and so the within query frequency does not matter ) or 2 * k2* nq / (1+L) as the length of the query is not needed in any way to build the Eset (Please do correct me if I am wrong about any assumptions Ive made so far ) . So,in order to use BM25 for weighing terms for Eset,we only need to modify the "multiplier" data member of the Expandstats class and then the final weight can be returned by ExpanWeight::get_weight( ) as (multiplier*tw) where tw will obviously be same for both the weighing schemes.Thus,depending on the weighing scheme and the parameters specified by the user in Enquire.get_eset( ) , multiplier can be calculated differently.This is fairly simple to implement.However,I have yet to figure out how to allow the user to specify a weighing scheme coded by him for building the Eset . Please help with that. This is the summary of what all Ive read and planned.Please let me know if I am wrong somewhere or if I can make improvements to any of this .Thank you for the awesome documentation of the code base ,it really helped a lot . :) Once I'm done with this hack and writing it's relevant documentation and tests,my next aim is to start working on incorporating DFR schemes in Xapian as we do not have them as yet and they appear to be very interesting for building both Eset and Mset as they don't require parameters. -Regards -Aarsh -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130208/dfc15596/attachment-0001.html>