Aarsh Shah
2014-Mar-04 15:46 UTC
[Xapian-devel] Test Dataset for performance and accuracy analysis
Hi Parth, I implemented DFR algorithms in Xapian as a part of GSOC last year under the mentorship of Olly. This year, I want to work on analyzing and optimizing the performance of the DFR algorithms and comparing them with BM25.I also want to work on profiling the query expansion schemes and test the relevance(precision and recall) / speed(time taken) of the algorithms . However, for this, I need a well defined data set containing a considerable amount of textual data, query logs containing queries that can be run on it, a set of relevant or expected documents which can be compared with the actual results to measure the relevance of the schemes. Please can you help me with this ? Thank you so much for your time. -Regards -Aarsh -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140304/8d4d01f8/attachment-0002.html>
Parth Gupta
2014-Mar-05 11:13 UTC
[Xapian-devel] Test Dataset for performance and accuracy analysis
Hi Aarsh, Yes, its very important to test the implemented algorithms on the benchmark collections. Most of the evaluation forums TREC, CLEF, INEX, FIRE, NTCIR release corresponding datasets. The most suitable one for you would be an ad-hoc collection which comprise of a document collection, topics (query-set) and qrels (relevance judgements). As these evaluation forums put a lot of effort (and money) in preparing them, they are not easily and freely available. Mostly such datasets are free for research if you are registered with them or you participate in their tracks. I see that INEX ad-hoc collection for 2009 and 2010 is available on registering, so you can register with them, log in and download the dataset along with queries and qrels for you. The link is: https://inex.mmci.uni-saarland.de/ Use the adhoc collection, it was also used for testing Letor implementation and BM25 in 2011 during GSoC ( http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme ). Cheers, Parth. On Tue, Mar 4, 2014 at 4:46 PM, Aarsh Shah <aarshkshah1992 at gmail.com> wrote:> Hi Parth, > > I implemented DFR algorithms in Xapian as > a part of GSOC last year under the mentorship of Olly. This year, I want to > work on analyzing and optimizing the performance of the DFR algorithms and > comparing them with BM25.I also want to work on profiling the query > expansion schemes and test the relevance(precision and recall) / speed(time > taken) of the algorithms . > However, for this, I need a well defined > data set containing a considerable amount of textual data, query logs > containing queries that can be run on it, a set of relevant or expected > documents which can be compared with the actual results to measure the > relevance of the schemes. Please can you help me with this ? Thank you so > much for your time. > > -Regards > -Aarsh >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140305/37723f58/attachment-0002.html>
Aarsh Shah
2014-Mar-05 22:42 UTC
[Xapian-devel] Test Dataset for performance and accuracy analysis
Hi Parth, I think this solves my problem. One part of my project is to build a performance test module which not only tests the speed but also the relevance of the weighting schemes to determine if we can use a better default weighting scheme. Gaurav has already written an evaluation module in Xapian.So, I think understanding how to use it , and then feeding it the data you've suggested after understanding the structure of the data will do the job. I will definitely come back to you if I need more help on the theory side of judging relevance. Thank you so much for your time. :) -Regards -Aarsh On Wed, Mar 5, 2014 at 4:43 PM, Parth Gupta <pargup8 at gmail.com> wrote:> Hi Aarsh, > > Yes, its very important to test the implemented algorithms on the > benchmark collections. Most of the evaluation forums TREC, CLEF, INEX, > FIRE, NTCIR release corresponding datasets. The most suitable one for you > would be an ad-hoc collection which comprise of a document collection, > topics (query-set) and qrels (relevance judgements). > > As these evaluation forums put a lot of effort (and money) in preparing > them, they are not easily and freely available. Mostly such datasets are > free for research if you are registered with them or you participate in > their tracks. > > I see that INEX ad-hoc collection for 2009 and 2010 is available on > registering, so you can register with them, log in and download the dataset > along with queries and qrels for you. The link is: > > https://inex.mmci.uni-saarland.de/ > > Use the adhoc collection, it was also used for testing Letor > implementation and BM25 in 2011 during GSoC ( > http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme > ). > > Cheers, > Parth. > > > On Tue, Mar 4, 2014 at 4:46 PM, Aarsh Shah <aarshkshah1992 at gmail.com>wrote: > >> Hi Parth, >> >> I implemented DFR algorithms in Xapian >> as a part of GSOC last year under the mentorship of Olly. This year, I want >> to work on analyzing and optimizing the performance of the DFR algorithms >> and comparing them with BM25.I also want to work on profiling the query >> expansion schemes and test the relevance(precision and recall) / speed(time >> taken) of the algorithms . >> However, for this, I need a well defined >> data set containing a considerable amount of textual data, query logs >> containing queries that can be run on it, a set of relevant or expected >> documents which can be compared with the actual results to measure the >> relevance of the schemes. Please can you help me with this ? Thank you so >> much for your time. >> >> -Regards >> -Aarsh >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140306/c7188a13/attachment-0002.html>