Rishabh Mehrotra
2012-Apr-01 11:30 UTC
[Xapian-devel] [GSoC2012] Learning to Rank: few thoughts/issues
Hello, I would like to work with Orange as part of GSoC 2012(and continue henceforth). Apologies for joining in a bit late- i was waiting to get a proper grasp of things before discussing it here. Currently I am a Masters students in Mathematics with my bachelors in Computer Science[integrated dual degree]. Over the last year and a half, I have worked on a few ML projects and have a couple of publications(including one at an ACL'11<http://www.acl2011.org/>workshop). Last year at Machine Learning Summer School[MLSS<http://mlss2011.comp.nus.edu.sg/index.php?n=Site.Speakers>] at NUS, I attended Hang Li<http://research.microsoft.com/en-us/people/hangli/>(MSR)'s tutorial on Learning to Rank. I have discussed a few things with him(over mail) about feature extraction for LTR algorithms. Over the last week I have been following the mailing list discussions here and researching a bit about the issues myself. I wanted to discuss about a few issues/thoughts: *Doubt1:* *Feature Extraction/Selection:* The various datasets listed on MSR's LETOR have a limited set of features. Current implementation in xapian's LETOR has 5 features[tf,idf,doc_len,coll_tf,coll_len]. While algorithms for learning ranking models have been intensively studied, this is not the case for feature selection, despite of its importance. In a paper presented at SIGIR'07 [Tier1 in IR domain], the authors have highlighted the effectiveness of feature selection methods for ranking tasks.[link<http://research.microsoft.com/en-us/people/tyliu/fsr.pdf>] I believe that apart from the traditional/cliched IR features, we should*incorporate new features * to improve the performance of the LETOR module. *Using unlabeled data:* Over the last 3-4 years a lot of papers have identified the importance of using unlabeled data to assist the task at hand by using it during feature extraction stage. Andrew Ng proposed a Self-Taught learning framework[ICML'07 paper<http://ai.stanford.edu/~hllee/icml07-selftaughtlearning.pdf>] wherein they make use of unlabeled data to improve performance. A very recent paper at ICML'11<http://eprints.pascal-network.org/archive/00008597/01/342_icmlpaper.pdf>used the advantage of feature learning using unlabeled data and beat the state-of-the-art in sentiment classification. Combining the above two points, I suggest an approach which uses features learnt from data in an unsupervised fashion "*in addition to*" the commonly used features. *Please note:* all this is in addition to the traditional features and finally we would be using *listwise/pairwise approaches*[ListMLE, et cetera] to train our models on the new set of features. Please let me know if this sounds good. *Doubt2:* *Rank Aggregation:* Now that Xappian will have >1 Learning to rank algorithms, we should look into some kind of rank aggregation as well: combining outputs by various algorithms to get a final rank ordering for results. I went though a ECML'07 paper on unsupervised method for the same[link<http://l2r.cs.uiuc.edu/~danr/Papers/KlementievRoSm07.pdf>]. I haven't yet completely understood their approach but will do so by the end of day. *Modularity:* Developing such modules in a modular fashion such that its not necessary to use all of them all the times, would be good. Whenever the user feels that in addition to basic features, he/she could use additional features, the feature extraction module could be plugged in. Same for rank aggregation. *Relevant Background:* I have worked on few research oriented projects in Machine Learning, but most of them involved coding in Matlab/Java. More details about me: [link<http://www.rishabhmehrotra.com/index.htm> ]. I have been working on a project on Topic Modeling(using Latent Dirichlet Allocation) for Tweets. Link <http://code.google.com/p/tweettrends/> of the code on Google code. Also, I am involved in a collage project on building *focused crawler *& extending it to something like NELL <http://rtw.ml.cmu.edu/rtw/><far-fetched dream as of now :) >.[Google code link<http://code.google.com/p/bits-crawler/source/browse/> ] Please let me know how you feel about the above mentioned points [and/or if I am way off the track]. Best, Rishabh. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120401/dec53ec2/attachment.html>
Parth Gupta
2012-Apr-02 10:22 UTC
[Xapian-devel] [GSoC2012] Learning to Rank: few thoughts/issues
Hello Rishabh, Good to hear from you. Its never late to jump-in for GSoC. *Doubt1:*> > *Feature Extraction/Selection:* > The various datasets listed on MSR's LETOR have a limited set of features. > Current implementation in xapian's LETOR has 5 > features[tf,idf,doc_len,coll_tf,coll_len]. While algorithms for learning > ranking models have been intensively studied, this is not the case for > feature selection, despite of its importance. In a paper presented at > SIGIR'07 [Tier1 in IR domain], the authors have highlighted the > effectiveness of feature selection methods for ranking tasks.[link<http://research.microsoft.com/en-us/people/tyliu/fsr.pdf>] > I believe that apart from the traditional/cliched IR features, we should*incorporate new features > * to improve the performance of the LETOR module. > >There is no point denying the fact that there is a need for more features. If you have noticed on the GSoC idea page of Letor, it says "The project can also include some work on the features, like adding support for more features, selecting a subset of features, etc." Now the point comes, which features you want to incorporate. The Letor datasets are growing enormously in terms of number of features [Letor MSR 46 ->136 , Yahoo Dataset 700]. It would make sense to incorporate those features which can be tracked and suits the environment. More over majority of the features dwell around the IR measures like bm25, TF, IDF, LM and different combination of them for different part of the document. Some of the other features of Letor Datasets are number of outgoing links, number of incoming links, page rank, number of children [1,2]. These features are valid and available only in the linked data and moreover, straight forward to compute. Yahoo dataset does not even declare the features because of the proprietary issues. But I think it also includes some features like the age of the page, number of clicks on it, total time spent, and so on. [1] http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/Features_in_LETOR4.pdf [2] http://research.microsoft.com/en-us/projects/mslr/feature.aspx> > *Using unlabeled data:* > Over the last 3-4 years a lot of papers have identified the importance of > using unlabeled data to assist the task at hand by using it during feature > extraction stage. Andrew Ng proposed a Self-Taught learning > framework[ICML'07 paper<http://ai.stanford.edu/%7Ehllee/icml07-selftaughtlearning.pdf>] > wherein they make use of unlabeled data to improve performance. A very > recent paper at ICML'11<http://eprints.pascal-network.org/archive/00008597/01/342_icmlpaper.pdf>used the advantage of feature learning using unlabeled data and beat the > state-of-the-art in sentiment classification. > > Combining the above two points, I suggest an approach which uses features > learnt from data in an unsupervised fashion "*in addition to*" the > commonly used features. > *Please note:* all this is in addition to the traditional features and > finally we would be using *listwise/pairwise approaches*[ListMLE, et > cetera] to train our models on the new set of features. Please let me know > if this sounds good. > >This phenomenon, Semi-supervised ranking, is indeed interesting. If you want to incorporate it, feel free to discuss the plan.> *Doubt2:* > > *Rank Aggregation:* > Now that Xappian will have >1 Learning to rank algorithms, we should look > into some kind of rank aggregation as well: combining outputs by various > algorithms to get a final rank ordering for results. I went though a > ECML'07 paper on unsupervised method for the same[link<http://l2r.cs.uiuc.edu/%7Edanr/Papers/KlementievRoSm07.pdf>]. > I haven't yet completely understood their approach but will do so by the > end of day. > >Rank Aggregation, is another LTR approach with a set of ranked lists at hand for the query. At the moment Xapian can have 2 ranked list, BM25 and SVM based LTR scheme. I think these techniques will produce better results with the input of more number of ranked list than Xapian can offer at the moment. But it would be interesting to explore after some more ranking schemes incorporation.> > *Modularity:* > Developing such modules in a modular fashion such that its not necessary > to use all of them all the times, would be good. Whenever the user feels > that in addition to basic features, he/she could use additional features, > the feature extraction module could be plugged in. Same for rank > aggregation. >Agreed and that in fact, that will be the goal. Best, Parth.> *Relevant Background:* > I have worked on few research oriented projects in Machine Learning, but > most of them involved coding in Matlab/Java. More details about me: [link<http://www.rishabhmehrotra.com/index.htm> > ]. > I have been working on a project on Topic Modeling(using Latent Dirichlet > Allocation) for Tweets. Link <http://code.google.com/p/tweettrends/> of > the code on Google code. Also, I am involved in a collage project on > building *focused crawler *& extending it to something like NELL<http://rtw.ml.cmu.edu/rtw/><far-fetched > dream as of now :) >.[Google code link<http://code.google.com/p/bits-crawler/source/browse/> > ] > > Please let me know how you feel about the above mentioned points [and/or > if I am way off the track]. > > Best, > Rishabh. > > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-devel > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120402/a44c1540/attachment.html>