Hello, I'm Ayush from New Delhi, India. I am interested in Letor Stabilisation project for GSoC. I have a good background in machine learning. Sorry for getting in so late, university exams were holding me back. I'll try to cover as much as I can in the coming week. I am following the plan of attack suggested on the project page. Following are the things that I have completed: 1. Getting current master branch building cleanly. 2. Going through all resources and papers mentioned on the project page. 3. Generating lcov test coverage reports. 4. Going through code in current master of xapian-letor and understanding all functionalities. Following are the things on which I am currently working on: 1. Modifying xapian-letor/bin/questletor.cc to use and test core features and API of letor. The current version of questletor.cc has a lot of unusable and broken functions and is custom made for training with INEX 2010 dataset. The intention is to make it usable for a user provided database. Currently I am using xapian-docsprint/data/100-objects-v1.csv as my database and some manually written queries and qrels to make things work. 2. Going through v-hasu's GSoC 2014 code to understand extra functionalities added by him and planning how to introduce code from his branch. In summary, the approach I will follow is going to be: 1. Creating a code example that lets the user use 100-objects-v1.csv as the database and use Letor features and API to make queries over it. Documenting how to make this example run. 2. Introducing features from 2014 projects and add to the above example. Document them. 3. Writing API and unit tests I have some question: 1. Is the procedure I mentioned above the right way to go about it? What are the essential portions (in terms of code) that I should complete before submitting the proposal? 2. How can I create the test harness for xapian-letor similar to xapian-core and start writing tests? Tests seem somewhat overwhelming to me at the moment, it would be helpful if I could get some assistance on how to go about it. 3. How important is writing new features for this project (for instance implementing LambdaMART ranking)? Should I focus on them as well in my proposal? Thanks, Ayush -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160320/cdd6fcff/attachment.html>
On Sun, Mar 20, 2016 at 05:31:37PM +0530, Ayush Tomar wrote:> I'm Ayush from New Delhi, India. I am interested in Letor Stabilisation > project for GSoC. I have a good background in machine learning. Sorry for > getting in so late, university exams were holding me back. I'll try to > cover as much as I can in the coming week.Hi, Ayush. Welcome to Xapian!> 1. Modifying xapian-letor/bin/questletor.cc to use and test core features > and API of letor. The current version of questletor.cc has a lot of > unusable and broken functions and is custom made for training with INEX > 2010 dataset. The intention is to make it usable for a user provided > database. Currently I am using xapian-docsprint/data/100-objects-v1.csv as > my database and some manually written queries and qrels to make things > work.That's helpful; I haven't looked at questletor in a while. I'm not surprised the master version doesn't work, because (as noted in the project) there's code that we couldn't merge for licensing reasons. Note that where the project talks about tests, we mean automated tests, probably unit tests. It's worth looking at how xapian-core does these, because we'd expect a similar approach for xapian-letor. (I think you're already clear on that, but I wanted to make sure!)> 2. Going through v-hasu's GSoC 2014 code to understand extra > functionalities added by him and planning how to introduce code from his > branch.Good.> 1. Creating a code example that lets the user use 100-objects-v1.csv as the > database and use Letor features and API to make queries over it. > Documenting how to make this example run.Note again that master probably won't be sufficient to do this. The missing functionality (ie the unmerged work) was rewritten on v-hasu's (Hanxiao Sun) branch, so can be pulled from there to form the base.> 3. Writing API and unit testsNote as the project description states that these should be done alongside integrating work, rather than considered separately.> I have some question: > > 1. Is the procedure I mentioned above the right way to go about it? What > are the essential portions (in terms of code) that I should complete before > submitting the proposal?It's not essential to complete any code ahead of the proposal, and as you have only a week now to do the proposal that needs to be your focus. Working with the code, however, is important to understand what work needs to done (and so will inform your proposal). So it's not necessary to be able to submit pull requests yet, but the work you've been doing in getting familiar with what code is there will form the basis of your proposal.> 2. How can I create the test harness for xapian-letor similar to > xapian-core and start writing tests? Tests seem somewhat overwhelming to me > at the moment, it would be helpful if I could get some assistance on how to > go about it.You'll need to copy the test harness. What I'd do is to copy the whole of the xapian-core/tests directory, then cut out all the actual tests. What's left should be the harness and supporting code. (You'll need to write some more support to> 3. How important is writing new features for this project (for instance > implementing LambdaMART ranking)? Should I focus on them as well in my > proposal?Not at all. There's more than enough work in stabilising and integrating previous work, writing tests and documentation, and creating a fully-working system suitable for general use. If you were to integrate all of v-hasu's branch and get that merged, then there's VcamX's (Jiarong Wei) work to look at from 2014, although that would require some more planning at the time (I wouldn't plan for that in your proposal). J -- James Aylett, occasional trouble-maker xapian.org
Hi Ayush, On top of what James has to say. I would recommend to focus first on VcamX's branch as he was working on API streamlining while v-hasu was implementing additional ranking algorithms. So have a look at it and just realign your thoughts while working on the proposal. He already tried to refactor questletor.cc into more independent tasks such as letor-prepare.cc, letor-train.cc etc. I have tried to give it a go to merge VcamX's master with xapian master and it lies here: https://github.com/parthg/xapian Most of the conflicts are resolved except "MSet" related parts in enquire.h You can play with it if you get time, it would definitely give you more insight into the current code-base. Cheers Parth On Sun, Mar 20, 2016 at 7:32 PM, James Aylett <james-xapian at tartarus.org> wrote:> On Sun, Mar 20, 2016 at 05:31:37PM +0530, Ayush Tomar wrote: > > > I'm Ayush from New Delhi, India. I am interested in Letor Stabilisation > > project for GSoC. I have a good background in machine learning. Sorry for > > getting in so late, university exams were holding me back. I'll try to > > cover as much as I can in the coming week. > > Hi, Ayush. Welcome to Xapian! > > > 1. Modifying xapian-letor/bin/questletor.cc to use and test core features > > and API of letor. The current version of questletor.cc has a lot of > > unusable and broken functions and is custom made for training with INEX > > 2010 dataset. The intention is to make it usable for a user provided > > database. Currently I am using xapian-docsprint/data/100-objects-v1.csv > as > > my database and some manually written queries and qrels to make things > > work. > > That's helpful; I haven't looked at questletor in a while. I'm not > surprised the master version doesn't work, because (as noted in the > project) there's code that we couldn't merge for licensing reasons. > > Note that where the project talks about tests, we mean automated > tests, probably unit tests. It's worth looking at how xapian-core does > these, because we'd expect a similar approach for xapian-letor. (I > think you're already clear on that, but I wanted to make sure!) > > > 2. Going through v-hasu's GSoC 2014 code to understand extra > > functionalities added by him and planning how to introduce code from his > > branch. > > Good. > > > 1. Creating a code example that lets the user use 100-objects-v1.csv as > the > > database and use Letor features and API to make queries over it. > > Documenting how to make this example run. > > Note again that master probably won't be sufficient to do this. The > missing functionality (ie the unmerged work) was rewritten on v-hasu's > (Hanxiao Sun) branch, so can be pulled from there to form the base. > > > 3. Writing API and unit tests > > Note as the project description states that these should be done > alongside integrating work, rather than considered separately. > > > I have some question: > > > > 1. Is the procedure I mentioned above the right way to go about it? What > > are the essential portions (in terms of code) that I should complete > before > > submitting the proposal? > > It's not essential to complete any code ahead of the proposal, and as > you have only a week now to do the proposal that needs to be your > focus. Working with the code, however, is important to understand what > work needs to done (and so will inform your proposal). So it's not > necessary to be able to submit pull requests yet, but the work you've > been doing in getting familiar with what code is there will form the > basis of your proposal. > > > 2. How can I create the test harness for xapian-letor similar to > > xapian-core and start writing tests? Tests seem somewhat overwhelming to > me > > at the moment, it would be helpful if I could get some assistance on how > to > > go about it. > > You'll need to copy the test harness. What I'd do is to copy the whole > of the xapian-core/tests directory, then cut out all the actual > tests. What's left should be the harness and supporting code. (You'll > need to write some more support to > > > 3. How important is writing new features for this project (for instance > > implementing LambdaMART ranking)? Should I focus on them as well in my > > proposal? > > Not at all. There's more than enough work in stabilising and > integrating previous work, writing tests and documentation, and > creating a fully-working system suitable for general use. If you were > to integrate all of v-hasu's branch and get that merged, then there's > VcamX's (Jiarong Wei) work to look at from 2014, although that would > require some more planning at the time (I wouldn't plan for that in > your proposal). > > J > > -- > James Aylett, occasional trouble-maker > xapian.org > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160321/496f875b/attachment.html>