thr3ads.net - Xapian devel - GSoC 2016 Letor dataset discussion [May 2016]

If this information is useful, please help other people find it:
Share via:

Ayush Tomar

2016-May-14 11:21 UTC

GSoC 2016 Letor dataset discussion

Hello,

I wanted to decide the dataset that should be used for Letor stabilisation
project.

I think 2009 INEX Wikipedia Collection
<http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/software/inex/>
should work fine. It's a collection of 2,666,190 XML articles, 115 topics
<http://inex.mmci.uni-saarland.de/protected/adhoc/2009-topics.zip>, 50,275
qrel <http://inex.mmci.uni-saarland.de/protected/adhoc/2009-inex_eval.zip>
labels and has an uncompressed size of 50.75 gb (5.52 GB compressed).

Another similar alternative is 2013 INEX Wikipedia LOD Collection
<http://inex-lod.mpi-inf.mpg.de/2013/>. It's a collection of
12,216,109 XML
articles, 144 topics
<http://inex.mmci.uni-saarland.de/protected/dc/2013-ld-adhoc-topics.xml>,
14,400
qrel
<http://inex.mmci.uni-saarland.de/protected/dc/2013-ld-adhoc-qrels.zip>
labels. It has a compressed size of 11.12 GB. INEX 2009 Collection is a
subset of it.

If there are any recent/better datasets that can be used, please let me
know.

Thanks,
Ayush
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/1a435bc8/attachment.html>

James Aylett

2016-May-14 16:27 UTC

head link

GSoC 2016 Letor dataset discussion

On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote:
> I wanted to decide the dataset that should be used for Letor stabilisation
> project.
Is this for evaluating the various letor approaches? For unit tests
you'll need to generate your own test data (partly so you can control
it better to do validation properly, but also because the licenses
almost never work).

Parth should be able to advise on suitable datasets for evaluating
letor.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Parth Gupta

2016-May-14 18:09 UTC

head link

GSoC 2016 Letor dataset discussion

I used a subset of INEX 2009 with around 2M documents (some details here:
https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme)
and it worked fine. If you have access to it, should work for most of our
purposes.

As the INEX documents have rich xml meta-data, letor can benefit in terms
of fields (title, body etc.)

For unit-testing, as James mentions, go with automated tests in a
controlled environment. Use INEX data-set for explicit evaluation and see
if everything works without breaking at large scale.

Cheers
Parth

On Sat, May 14, 2016 at 9:57 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Sat, May 14, 2016 at 04:51:57PM +0530, Ayush Tomar wrote:
>
> > I wanted to decide the dataset that should be used for Letor
> stabilisation
> > project.
>
> Is this for evaluating the various letor approaches? For unit tests
> you'll need to generate your own test data (partly so you can control
> it better to do validation properly, but also because the licenses
> almost never work).
>
> Parth should be able to advise on suitable datasets for evaluating
> letor.
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160514/3d510daa/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

Xapian devel - May 2016 - GSoC 2016 Letor dataset discussion

GSoC 2016 Letor dataset discussion

GSoC 2016 Letor dataset discussion

GSoC 2016 Letor dataset discussion

Maybe Matching Threads