Hello, I wanted to decide the dataset that should be used for Letor stabilisation project. I think 2009 INEX Wikipedia Collection <http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/software/inex/> should work fine. It's a collection of 2,666,190 XML articles, 115 topics <http://inex.mmci.uni-saarland.de/protected/adhoc/2009-topics.zip>, 50,275 qrel <http://inex.mmci.uni-saarland.de/protected/adhoc/2009-inex_eval.zip> labels and has an uncompressed size of 50.75 gb (5.52 GB compressed). Another similar alternative is 2013 INEX Wikipedia LOD Collection <http://inex-lod.mpi-inf.mpg.de/2013/>. It's a collection of 12,216,109 XML articles, 144 topics <http://inex.mmci.uni-saarland.de/protected/dc/2013-ld-adhoc-topics.xml>, 14,400 qrel <http://inex.mmci.uni-saarland.de/protected/dc/2013-ld-adhoc-qrels.zip> labels. It has a compressed size of 11.12 GB. INEX 2009 Collection is a subset of it. If there are any recent/better datasets that can be used, please let me know. Thanks, Ayush -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/1a435bc8/attachment.html>
On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote:> I wanted to decide the dataset that should be used for Letor stabilisation > project.Is this for evaluating the various letor approaches? For unit tests you'll need to generate your own test data (partly so you can control it better to do validation properly, but also because the licenses almost never work). Parth should be able to advise on suitable datasets for evaluating letor. J -- James Aylett, occasional trouble-maker xapian.org
I used a subset of INEX 2009 with around 2M documents (some details here: https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme) and it worked fine. If you have access to it, should work for most of our purposes. As the INEX documents have rich xml meta-data, letor can benefit in terms of fields (title, body etc.) For unit-testing, as James mentions, go with automated tests in a controlled environment. Use INEX data-set for explicit evaluation and see if everything works without breaking at large scale. Cheers Parth On Sat, May 14, 2016 at 9:57 PM, James Aylett <james-xapian at tartarus.org> wrote:> On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote: > > > I wanted to decide the dataset that should be used for Letor > stabilisation > > project. > > Is this for evaluating the various letor approaches? For unit tests > you'll need to generate your own test data (partly so you can control > it better to do validation properly, but also because the licenses > almost never work). > > Parth should be able to advise on suitable datasets for evaluating > letor. > > J > > -- > James Aylett, occasional trouble-maker > xapian.org > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/3d510daa/attachment.html>