thr3ads.net - Xapian devel - [Xapian-devel] Test Dataset for performance and accuracy analysis [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Aarsh Shah

2014-Mar-04 15:46 UTC

[Xapian-devel] Test Dataset for performance and accuracy analysis

Hi Parth,

                                I implemented DFR algorithms  in Xapian as
a part of GSOC last year under the mentorship of Olly. This year, I want to
work on analyzing and optimizing the performance of the DFR algorithms and
comparing them with BM25.I also want to work on profiling the query
expansion schemes and test the relevance(precision and recall) / speed(time
taken) of the algorithms .
                                 However, for this, I need a well defined
data set containing a considerable amount of textual data, query logs
containing queries that can be run on it, a set of relevant or expected
documents which can be compared with the actual results to measure the
relevance of the schemes. Please can you help me with this ? Thank you so
much for your time.

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140304/8d4d01f8/attachment-0002.html>

Parth Gupta

2014-Mar-05 11:13 UTC

head link

[Xapian-devel] Test Dataset for performance and accuracy analysis

Hi Aarsh,

Yes, its very important to test the implemented algorithms on the benchmark
collections. Most of the evaluation forums TREC, CLEF, INEX, FIRE, NTCIR
release corresponding datasets. The most suitable one for you would be an
ad-hoc collection which comprise of a document collection, topics
(query-set) and qrels (relevance judgements).

As these evaluation forums put a lot of effort (and money) in preparing
them, they are not easily and freely available. Mostly such datasets are
free for research if you are registered with them or you participate in
their tracks.

I see that INEX ad-hoc collection for 2009 and 2010 is available on
registering, so you can register with them, log in and download the dataset
along with queries and qrels for you. The link is:

https://inex.mmci.uni-saarland.de/

Use the adhoc collection, it was also used for testing Letor implementation
and BM25 in 2011 during GSoC (
http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme
).

Cheers,
Parth.

On Tue, Mar 4, 2014 at 4:46 PM, Aarsh Shah <aarshkshah1992 at gmail.com>
wrote:
> Hi Parth,
>
>                                 I implemented DFR algorithms  in Xapian as
> a part of GSOC last year under the mentorship of Olly. This year, I want to
> work on analyzing and optimizing the performance of the DFR algorithms and
> comparing them with BM25.I also want to work on profiling the query
> expansion schemes and test the relevance(precision and recall) / speed(time
> taken) of the algorithms .
>                                  However, for this, I need a well defined
> data set containing a considerable amount of textual data, query logs
> containing queries that can be run on it, a set of relevant or expected
> documents which can be compared with the actual results to measure the
> relevance of the schemes. Please can you help me with this ? Thank you so
> much for your time.
>
> -Regards
> -Aarsh
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140305/37723f58/attachment-0002.html>

Aarsh Shah

2014-Mar-05 22:42 UTC

head link

[Xapian-devel] Test Dataset for performance and accuracy analysis

Hi Parth,

I think this solves my problem. One part of  my project is to build a
performance test module which not only tests the speed but also the
relevance of the weighting schemes to determine if we can use a better
default weighting scheme. Gaurav has already written an evaluation module
in Xapian.So, I think understanding how to use it , and then feeding it the
data you've suggested after understanding the structure of the data will do
the job. I will definitely come back to you if I need more help on the
theory side of judging relevance. Thank you so much for your time. :)

-Regards
-Aarsh


On Wed, Mar 5, 2014 at 4:43 PM, Parth Gupta <pargup8 at gmail.com> wrote:
> Hi Aarsh,
>
> Yes, its very important to test the implemented algorithms on the
> benchmark collections. Most of the evaluation forums TREC, CLEF, INEX,
> FIRE, NTCIR release corresponding datasets. The most suitable one for you
> would be an ad-hoc collection which comprise of a document collection,
> topics (query-set) and qrels (relevance judgements).
>
> As these evaluation forums put a lot of effort (and money) in preparing
> them, they are not easily and freely available. Mostly such datasets are
> free for research if you are registered with them or you participate in
> their tracks.
>
> I see that INEX ad-hoc collection for 2009 and 2010 is available on
> registering, so you can register with them, log in and download the dataset
> along with queries and qrels for you. The link is:
>
> https://inex.mmci.uni-saarland.de/
>
> Use the adhoc collection, it was also used for testing Letor
> implementation and BM25 in 2011 during GSoC (
>
http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#IREvaluationofLetorrankingscheme
> ).
>
> Cheers,
> Parth.
>
>
> On Tue, Mar 4, 2014 at 4:46 PM, Aarsh Shah <aarshkshah1992 at
gmail.com>wrote:
>
>> Hi Parth,
>>
>>                                 I implemented DFR algorithms  in Xapian
>> as a part of GSOC last year under the mentorship of Olly. This year, I
want
>> to work on analyzing and optimizing the performance of the DFR
algorithms
>> and comparing them with BM25.I also want to work on profiling the query
>> expansion schemes and test the relevance(precision and recall) /
speed(time
>> taken) of the algorithms .
>>                                  However, for this, I need a well
defined
>> data set containing a considerable amount of textual data, query logs
>> containing queries that can be run on it, a set of relevant or expected
>> documents which can be compared with the actual results to measure the
>> relevance of the schemes. Please can you help me with this ? Thank you
so
>> much for your time.
>>
>> -Regards
>> -Aarsh
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140306/c7188a13/attachment-0002.html>

Seemingly Similar Threads

Search for more reasonably related threads

Xapian devel - Mar 2014 - Test Dataset for performance and accuracy analysis

[Xapian-devel] Test Dataset for performance and accuracy analysis

[Xapian-devel] Test Dataset for performance and accuracy analysis

[Xapian-devel] Test Dataset for performance and accuracy analysis

Seemingly Similar Threads