Abhishek Singh Kushwah
2014-Nov-23 08:29 UTC
[Xapian-devel] GSoc Project Idea Weighting Schemes (Ranking)
Hi, I am Abhishek Currently Xapian::Weight follows BM25 scheme, many models such as the Divergence from Randomness (DfR) family of models, Unigram Language Model and the Bi-gram Language Model implemented two years ago in GSoc 2012 yet not merged to the master. The new weighing schemes or improvement in implementing the previous models to change the default scheme of BM25 from SMART with reference to this paper www.aclweb.org/anthology/P10-1141 After skimming through the schemes implemented in Xapian::weight. There seems a considerable hope in editing the algorithms to increase efficiency and speed and implementing new ones in use. I would need mentors point of view regarding new schemes for the project wrt SMART and others. Thank You -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20141123/3e2aff8e/attachment-0002.html>
Olly Betts
2014-Nov-23 21:50 UTC
[Xapian-devel] GSoc Project Idea Weighting Schemes (Ranking)
On Sun, Nov 23, 2014 at 01:59:53PM +0530, Abhishek Singh Kushwah wrote:> The new weighing schemes or improvement in implementing the previous models > to change the default scheme of BM25 from SMART with reference to this > paper www.aclweb.org/anthology/P10-1141I don't any motivation there for changing the default - a quote from that paper actually explicitly notes that BM25 is much more successful in general, while performing similarly in this particularly case: Of interest is the fact that although the BM25 tf algorithm has proved much more successful in IR, the same doesn?t apply in this setting and its accuracy is similar to the simpler augmented tf approach.> After skimming through the schemes implemented in Xapian::weight. There > seems a considerable hope in editing the algorithms to increase efficiency > and speed and implementing new ones in use.Where do you think speed and efficiency can be improved?> I would need mentors point of view regarding new schemes for the project > wrt SMART and others.Schemes need to be possible to sanely implement within Xapian's weighting framework. Needing to track more statistics is probably OK though (e.g. LM required adding support for getting the number of unique terms in each document). Schemes which have been evaluated and shown to be promising (even if in a restricted domain) are more interesting. We aren't looking for students to develop their own weighting scheme from scratch as part of a GSoC project (someone proposed this in a previous GSoC). Were you the "abhishek" asking recently on IRC about installing on mingw? If so and you didn't already resolve that, showing us the error would probably enable us to help. Cheers, Olly
Abhishek Singh Kushwah
2014-Nov-24 05:09 UTC
[Xapian-devel] GSoc Project Idea Weighting Schemes (Ranking)
Thanks Olly, The windows installation problem has been resolved, Well certainly BM25 offers stability and comparatively speed too which is why it is more preferred than others. What I have tried to understood from your point, no new schemes are needed to be implemented for now at least in this GSoC. So probably the default scheme needs to be improved and the previous implemented schemes in restricted domain needs to be brought forward. Probably you are thinking for improvements in Unigram Language Modelling and Bi-gram Language Modelling implemented in GSoC 2012. If that's the case then your explanation towards more appropriate goals would be appreciated. One such another feature you mentioned to add support for getting the number of unique terms is a great idea and can be implemented possibly for the purpose of getting more statistics in this GSoC. -Abhishek -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20141124/a7b4f06c/attachment-0002.html>