Parth Gupta
2014-May-19 09:58 UTC
[Xapian-devel] Indexing INEX collection for your GSoC Project
Hi Aarsh, I see we miss each other on the IRC, so I am replying you here. It will be a good idea if all the GSoC students, who require some external datasets for testing and development, use the same collection. I recommend you INEX collection which also will be used by LTR students. I have a doubt that you have got the correct collection or not, because I read you mentioning IMDB. The collection which I referred is Wikipedia collection (NOT IMDB) and is available at: http://www.mpi-inf.mpg.de/departments/d5/software/inex/ Some details are available at LTR project idea page: http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:LearningtoRank For indexing these XML documents, simply you should treat them as HTML by doing "--mime-type xml:text/html". Although this is not the correct way but it does the job and gets you started. There is also some efficiency notes on my Jounral page during GSoC 2011 (See coding week 3) http://trac.xapian.org/wiki/GSoC2011/LTR/Journal For the queries, you can use Topics distributed with INEX for the "Ad-hoc Retrieval Task" (as mentioned on the LTR project idea page). You can write your own iterator to parse and iterate over query file. See prepare_training_file() method in xapian-letor ( https://github.com/parthg/xapian/blob/master/xapian-letor/letor_internal.cc#L356) which does that. If you want to consider a large query set then you might be intersted in Million Query Set (http://trec.nist.gov/data/million.query09.html) which contains 40k web Queries. If you need even larger set then go for AOL Query Logs (http://jeffhuang.com/search_query_logs.html) which contains 36M Queries. Cheers, Parth. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140519/1552af12/attachment-0002.html>
Olly Betts
2014-May-19 11:26 UTC
[Xapian-devel] Indexing INEX collection for your GSoC Project
On Mon, May 19, 2014 at 11:58:36AM +0200, Parth Gupta wrote:> For indexing these XML documents, simply you should treat them as HTML by > doing "--mime-type xml:text/html". Although this is not the correct way but > it does the job and gets you started.While that's fine for the letor projects, the point of Aarsh's work on this is to produce a performance test suite, so indexing the data via omindex probably isn't a good approach. Cheers, Olly