thr3ads.net - Xapian devel - [Xapian-devel] Indexing INEX collection for your GSoC Project [May 2014]

If this information is useful, please help other people find it:
Share via:

Parth Gupta

2014-May-19 09:58 UTC

[Xapian-devel] Indexing INEX collection for your GSoC Project

Hi Aarsh,

I see we miss each other on the IRC, so I am replying you here.

It will be a good idea if all the GSoC students, who require some external
datasets for testing and development, use the same collection.

I recommend you INEX collection which also will be used by LTR students. I
have a doubt that you have got the correct collection or not, because I
read you mentioning IMDB. The collection which I referred is Wikipedia
collection (NOT IMDB) and is available at:
http://www.mpi-inf.mpg.de/departments/d5/software/inex/

Some details are available at LTR project idea page:
http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:LearningtoRank

For indexing these XML documents, simply you should treat them as HTML by
doing "--mime-type xml:text/html". Although this is not the correct
way but
it does the job and gets you started.

There is also some efficiency notes on my Jounral page during GSoC 2011
(See coding week 3) http://trac.xapian.org/wiki/GSoC2011/LTR/Journal

For the queries, you can use Topics distributed with INEX for the "Ad-hoc
Retrieval Task" (as mentioned on the LTR project idea page).

You can write your own iterator to parse and iterate over query file. See
prepare_training_file() method in xapian-letor (
https://github.com/parthg/xapian/blob/master/xapian-letor/letor_internal.cc#L356)
which does that.

If you want to consider a large query set then you might be intersted in
Million Query Set (http://trec.nist.gov/data/million.query09.html) which
contains 40k web Queries. If you need even larger set then go for AOL Query
Logs (http://jeffhuang.com/search_query_logs.html) which contains 36M
Queries.

Cheers,
Parth.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140519/1552af12/attachment-0002.html>

Olly Betts

2014-May-19 11:26 UTC

head link

[Xapian-devel] Indexing INEX collection for your GSoC Project

On Mon, May 19, 2014 at 11:58:36AM +0200, Parth Gupta
wrote:> For indexing these XML documents, simply you should treat them as HTML by
> doing "--mime-type xml:text/html". Although this is not the
correct way but
> it does the job and gets you started.
While that's fine for the letor projects, the point of Aarsh's work on
this is to produce a performance test suite, so indexing the data via
omindex probably isn't a good approach.

Cheers,
    Olly

Xapian devel - May 2014 - Indexing INEX collection for your GSoC Project

[Xapian-devel] Indexing INEX collection for your GSoC Project

[Xapian-devel] Indexing INEX collection for your GSoC Project