thr3ads.net - Xapian devel - [Xapian-devel] Proposal Outline [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Mayank Chaudhary

2014-Mar-11 19:40 UTC

[Xapian-devel] Proposal Outline

Hi,

Before starting my proposal, I wanted to know what is the expected output
of Letor module. Is it for transfer learning (i.e you learn from one
dataset and leverage it to predict the rankings of other dataset) or is it
for supervised learning?

For instance - Xapian currently powers the Gmane search which is by default
based on BM25 weighting scheme and now suppose we want to use LETOR to rank
the top k retrieved search results, lets take SVMRanker for an example,
will it rank the Gmane's search results based on the weights learned from
INEX dataset because the client won't be providing any training file. And
also I don't think it'll perform good for two datasets of different
distributions. So how are we going to use it?

PROPOSAL-

1.Sorting out Letor API will include -

- Implementing SVMRanker and checking its evaluation results against the
already generated values.

- Implementing evaluation methods. Those methods will include MAP and
NDCG. (*Is there any other method in particular that can be implemented
other than these two?*)

- Check the performance of ListMLE and ListNet against
SVMRanker.(*Considering
both ListMLE and ListNet has been implemented correctly but we don't have
any tested performance measurement of these two algorithms*. *Therefore
I want to know what should be course of action for this?*)

- Implementing Rank aggregator. I've read about *Kemmy-Young Method*.
Can you provide me with the names of the algorithms based on what should be
implemented here or what was proposed last-to-last year. Also is there a
way to check any ranker's performance(*since INEX dataset doesn't
provide ranking*).

2. Implementing automated tests will include -

- For testing, 20 documents and 5 queries can be picked from the INEX
dataset, put to test and checked against their expected outputs.

- Implemented evaluation metrics can also be used to test learning
algorithms.

3.Implementing a feature selection algorithms-

- I have a question here. Why are we planning to implement feature
selection algorithm when we have only 19 features vectors. I don't think
it'll over-fit the dataset. Also from what I have learnt, feature
selection
algorithms(like PCA in classification) are used only for time or space
efficiencies.

Please do provide some feedback so that I can improve upon it.

-Mayank
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140312/3297d4b0/attachment-0002.html>

Mayank Chaudhary

2014-Mar-12 21:47 UTC

head link

[Xapian-devel] Proposal Outline

Well, I found out one interesting graph in the paper "Feature selection for
Ranking".

[image: Inline image 3]

The evaluation metric(MAP) value just became double when the number of
features are reduced from 15 to 1. It's really motivating to implement the
feature selection algorithm. I was wondering if we could've more features
to add to feature vector so that we may have a considerable number of
features to select from.


On Wed, Mar 12, 2014 at 1:10 AM, Mayank Chaudhary <
mayankchaudhary.iitr at gmail.com> wrote:
> Hi,
>
> Before starting my proposal, I wanted to know what is the expected output
> of Letor module. Is it for transfer learning (i.e you learn from one
> dataset and leverage it to predict the rankings of other dataset) or is it
> for supervised learning?
>
> For instance - Xapian currently powers the Gmane search which is by
> default based on BM25 weighting scheme and now suppose we want to use LETOR
> to rank the top k retrieved search results, lets take SVMRanker for an
> example, will it rank the Gmane's search results based on the weights
> learned from INEX dataset because the client won't be providing any
> training file. And also I don't think it'll perform good for two
datasets
> of different distributions. So how are we going to use it?
>
> PROPOSAL-
>
> 1.Sorting out Letor API will include -
>
>    - Implementing SVMRanker and checking its evaluation results against
>    the already generated values.
>
>
>    - Implementing evaluation methods. Those methods will include MAP and
>    NDCG. (*Is there any other method in particular that can be
>    implemented other than these two?*)
>
>
>    - Check the performance of ListMLE and ListNet against
SVMRanker.(*Considering
>    both ListMLE and ListNet has been implemented correctly but we don't
have
>    any tested performance measurement of these two algorithms*. *Therefore
>    I want to know what should be course of action for this?*)
>
>
>    - Implementing Rank aggregator. I've read about *Kemmy-Young
Method*.
>    Can you provide me with the names of the algorithms based on what should
be
>    implemented here or what was proposed last-to-last year. Also is there a
>    way to check any ranker's performance(*since INEX dataset
doesn't
>    provide ranking*).
>
> 2. Implementing automated tests will include -
>
>    - For testing, 20 documents and 5 queries can be picked from the INEX
>    dataset, put to test and checked against their expected outputs.
>
>
>    - Implemented evaluation metrics can also be used to test learning
>    algorithms.
>
> 3.Implementing a feature selection algorithms-
>
>    - I have a question here. Why are we planning to implement feature
>    selection algorithm when we have only 19 features vectors. I don't
think
>    it'll over-fit the dataset. Also from what I have learnt, feature
selection
>    algorithms(like PCA in classification) are used only for time or space
>    efficiencies.
>
> Please do provide some feedback so that I can improve upon it.
>
> -Mayank
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/8e1ba2f4/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2014-03-13 03:03:11.v01.png
Type: image/png
Size: 28212 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/8e1ba2f4/attachment-0002.png>

Parth Gupta

2014-Mar-14 12:52 UTC

head link

[Xapian-devel] Proposal Outline

Hi Mayank,

Before starting my proposal, I wanted to know what is the expected
output> of Letor module. Is it for transfer learning (i.e you learn from one
> dataset and leverage it to predict the rankings of other dataset) or is it
> for supervised learning?
>
> For instance - Xapian currently powers the Gmane search which is by
> default based on BM25 weighting scheme and now suppose we want to use LETOR
> to rank the top k retrieved search results, lets take SVMRanker for an
> example, will it rank the Gmane's search results based on the weights
> learned from INEX dataset because the client won't be providing any
> training file. And also I don't think it'll perform good for two
datasets
> of different distributions. So how are we going to use it?
>
The actual purpose of xapian-letor is to provide learning to rank system to
the user who intend to perform search. Though, it may sound naive and
simple, but it is the actual goal of the letor module. Letor being a
supervised ranking algorithm its requires gold-standard labels which
unsupervised methods like BM25 or TF-IDF do not demand.
>From the application point of view, we provide user a complete API whichuser can deploy to do search rank. We do not provide any gold-standard data
or document collection. Hence, if the user has document collection on which
she intend to do search and has some gold labels on that collection, she is
good to use xapian-letor. We provide a platform, which can extract features
from document and create training collection, learn the ranking function
and than perform ranking on unseen queries once the model is trained.

Of course it is "little" hard to obtain gold labels but research on
the
clickthrough data is providing means to obtain some automatically.

>
> PROPOSAL-
>
> 1.Sorting out Letor API will include -
>
>    - Implementing SVMRanker and checking its evaluation results against
>    the already generated values.
>
>
>    - Implementing evaluation methods. Those methods will include MAP and
>    NDCG. (*Is there any other method in particular that can be
>    implemented other than these two?*)
>
> The most common are these two. While implementing them  you will also useprecision and recall.
>
>
>
>    - Check the performance of ListMLE and ListNet against
SVMRanker.(*Considering
>    both ListMLE and ListNet has been implemented correctly but we don't
have
>    any tested performance measurement of these two algorithms*. *Therefore
>    I want to know what should be course of action for this?*)
>
>  We need to check how the ListMLE and ListNet performs and if something iswrong then debug them. The best method is to use the a common evaluation
environment for three of them and check/correct.
>
>    - Implementing Rank aggregator. I've read about *Kemmy-Young
Method*.
>    Can you provide me with the names of the algorithms based on what should
be
>    implemented here or what was proposed last-to-last year. Also is there a
>    way to check any ranker's performance(*since INEX dataset
doesn't
>    provide ranking*).
>
> I am not sure we should include rank aggregation or not but one of thepaper to refer would be
http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf

>
>
> 2. Implementing automated tests will include -
>
>    - For testing, 20 documents and 5 queries can be picked from the INEX
>    dataset, put to test and checked against their expected outputs.
>
>
>    - Implemented evaluation metrics can also be used to test learning
>    algorithms.
>
> I think last year Gaurav Arora (IRC nick: samuaelharden) was handling someevaluation but I am not sure the state of it and you can check if that can
be used for letor in terms of passing Letor::RankList as parameter and
receiving MAP or NDCG value.

>
>
> 3.Implementing a feature selection algorithms-
>
>    - I have a question here. Why are we planning to implement feature
>    selection algorithm when we have only 19 features vectors. I don't
think
>    it'll over-fit the dataset. Also from what I have learnt, feature
selection
>    algorithms(like PCA in classification) are used only for time or space
>    efficiencies.
>
> Feature selection is a utility if someone wants to use it. xapian-letorcan also operate on the data outside the limit of currently implemented 19
features. These 19 features are which we can extract but if user has
already a training file with 300 features, she should be able to train the
letor model over that file and when she wants to rank a document, she
should be able to provide the similar feature vector and in between the
feature selection algorithm can help.

Feature selection algorithms have really proved to significantly outperform
the full feature set. See both references of feature selection in the
resources section on Project Ideas page.

This time we want to make sure adding more features become very easy for
anybody. For example a new feature can be term frequency of query terms in
URL which will become 20th feature. The API should be very flexible for
this extension.

Cheers,
Parth.

>
>
> Please do provide some feedback so that I can improve upon it.
>
> -Mayank
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140314/53f1e43f/attachment-0002.html>

Olly Betts

2014-Mar-14 23:20 UTC

head link

[Xapian-devel] Proposal Outline

On Thu, Mar 13, 2014 at 03:17:34AM +0530, Mayank Chaudhary
wrote:> The evaluation metric(MAP) value just became double when the number of
> features are reduced from 15 to 1. It's really motivating to implement
the
> feature selection algorithm. I was wondering if we could've more
features
> to add to feature vector so that we may have a considerable number of
> features to select from.
I'd agree with Parth that making it easy to add support for new features
would be very useful.  If there's time, you could also actually add some
more features (so perhaps that's a good "optional goal" for your
proposal), but I think that's less important than getting the module to
a state where users can easily start to use it.

Cheers,
    Olly

Possibly Parallel Threads

Search for more seemingly similar threads

Xapian devel - Mar 2014 - Proposal Outline

[Xapian-devel] Proposal Outline

[Xapian-devel] Proposal Outline

[Xapian-devel] Proposal Outline

[Xapian-devel] Proposal Outline

Possibly Parallel Threads