thr3ads.net - Xapian devel - GSoC 2016 Letor Stabilisation [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Ayush Tomar

2016-Mar-20 12:01 UTC

GSoC 2016 Letor Stabilisation

Hello,

I'm Ayush from New Delhi, India. I am interested in Letor Stabilisation
project for GSoC. I have a good background in machine learning. Sorry for
getting in so late, university exams were holding me back. I'll try to
cover as much as I can in the coming week.

I am following the plan of attack suggested on the project page. Following
are the things that I have completed:

1. Getting current master branch building cleanly.
2. Going through all resources and papers mentioned on the project page.
3. Generating lcov test coverage reports.
4. Going through code in current master of xapian-letor and understanding
all functionalities.

Following are the things on which I am currently working on:

1. Modifying xapian-letor/bin/questletor.cc to use and test core features
and API of letor. The current version of questletor.cc has a lot of
unusable and broken functions and is custom made for training with INEX
2010 dataset. The intention is to make it usable for a user provided
database. Currently I am using xapian-docsprint/data/100-objects-v1.csv as
my database and some manually written queries and qrels to make things
work.
2. Going through v-hasu's GSoC 2014 code to understand extra
functionalities added by him and planning how to introduce code from his
branch.

In summary, the approach I will follow is going to be:

1. Creating a code example that lets the user use 100-objects-v1.csv as the
database and use Letor features and API to make queries over it.
Documenting how to make this example run.
2. Introducing features from 2014 projects and add to the above example.
Document them.
3. Writing API and unit tests

I have some question:

1. Is the procedure I mentioned above the right way to go about it? What
are the essential portions (in terms of code) that I should complete before
submitting the proposal?
2. How can I create the test harness for xapian-letor similar to
xapian-core and start writing tests? Tests seem somewhat overwhelming to me
at the moment, it would be helpful if I could get some assistance on how to
go about it.
3. How important is writing new features for this project (for instance
implementing LambdaMART ranking)? Should I focus on them as well in my
proposal?

Thanks,
Ayush
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160320/cdd6fcff/attachment.html>

James Aylett

2016-Mar-20 14:02 UTC

head link

GSoC 2016 Letor Stabilisation

On Sun, Mar 20, 2016 at 05:31:37PM +0530, Ayush Tomar wrote:
> I'm Ayush from New Delhi, India. I am interested in Letor Stabilisation
> project for GSoC. I have a good background in machine learning. Sorry for
> getting in so late, university exams were holding me back. I'll try to
> cover as much as I can in the coming week.
Hi, Ayush. Welcome to Xapian!
> 1. Modifying xapian-letor/bin/questletor.cc to use and test core features
> and API of letor. The current version of questletor.cc has a lot of
> unusable and broken functions and is custom made for training with INEX
> 2010 dataset. The intention is to make it usable for a user provided
> database. Currently I am using xapian-docsprint/data/100-objects-v1.csv as
> my database and some manually written queries and qrels to make things
> work.
That's helpful; I haven't looked at questletor in a while. I'm not
surprised the master version doesn't work, because (as noted in the
project) there's code that we couldn't merge for licensing reasons.

Note that where the project talks about tests, we mean automated
tests, probably unit tests. It's worth looking at how xapian-core does
these, because we'd expect a similar approach for xapian-letor. (I
think you're already clear on that, but I wanted to make sure!)
> 2. Going through v-hasu's GSoC 2014 code to understand extra
> functionalities added by him and planning how to introduce code from his
> branch.
Good.
> 1. Creating a code example that lets the user use 100-objects-v1.csv as the
> database and use Letor features and API to make queries over it.
> Documenting how to make this example run.
Note again that master probably won't be sufficient to do this. The
missing functionality (ie the unmerged work) was rewritten on v-hasu's
(Hanxiao Sun) branch, so can be pulled from there to form the base.
> 3. Writing API and unit tests
Note as the project description states that these should be done
alongside integrating work, rather than considered separately.
> I have some question:
> 
> 1. Is the procedure I mentioned above the right way to go about it? What
> are the essential portions (in terms of code) that I should complete before
> submitting the proposal?
It's not essential to complete any code ahead of the proposal, and as
you have only a week now to do the proposal that needs to be your
focus. Working with the code, however, is important to understand what
work needs to done (and so will inform your proposal). So it's not
necessary to be able to submit pull requests yet, but the work you've
been doing in getting familiar with what code is there will form the
basis of your proposal.
> 2. How can I create the test harness for xapian-letor similar to
> xapian-core and start writing tests? Tests seem somewhat overwhelming to me
> at the moment, it would be helpful if I could get some assistance on how to
> go about it.
You'll need to copy the test harness. What I'd do is to copy the whole
of the xapian-core/tests directory, then cut out all the actual
tests. What's left should be the harness and supporting code. (You'll
need to write some more support to 
> 3. How important is writing new features for this project (for instance
> implementing LambdaMART ranking)? Should I focus on them as well in my
> proposal?
Not at all. There's more than enough work in stabilising and
integrating previous work, writing tests and documentation, and
creating a fully-working system suitable for general use. If you were
to integrate all of v-hasu's branch and get that merged, then there's
VcamX's (Jiarong Wei) work to look at from 2014, although that would
require some more planning at the time (I wouldn't plan for that in
your proposal).

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Parth Gupta

2016-Mar-21 17:59 UTC

head link

GSoC 2016 Letor Stabilisation

Hi Ayush,

On top of what James has to say. I would recommend to focus first on
VcamX's branch as he was working on API streamlining while v-hasu was
implementing additional ranking algorithms. So have a look at it and just
realign your thoughts while working on the proposal. He already tried to
refactor questletor.cc into more independent tasks such as
letor-prepare.cc, letor-train.cc etc.

I have tried to give it a go to merge VcamX's master with xapian master and
it lies here: https://github.com/parthg/xapian

Most of the conflicts are resolved except "MSet" related parts in
enquire.h

You can play with it if you get time, it would definitely give you more
insight into the current code-base.

Cheers
Parth



On Sun, Mar 20, 2016 at 7:32 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Sun, Mar 20, 2016 at 05:31:37PM +0530, Ayush Tomar wrote:
>
> > I'm Ayush from New Delhi, India. I am interested in Letor
Stabilisation
> > project for GSoC. I have a good background in machine learning. Sorry
for
> > getting in so late, university exams were holding me back. I'll
try to
> > cover as much as I can in the coming week.
>
> Hi, Ayush. Welcome to Xapian!
>
> > 1. Modifying xapian-letor/bin/questletor.cc to use and test core
features
> > and API of letor. The current version of questletor.cc has a lot of
> > unusable and broken functions and is custom made for training with
INEX
> > 2010 dataset. The intention is to make it usable for a user provided
> > database. Currently I am using
xapian-docsprint/data/100-objects-v1.csv
> as
> > my database and some manually written queries and qrels to make things
> > work.
>
> That's helpful; I haven't looked at questletor in a while. I'm
not
> surprised the master version doesn't work, because (as noted in the
> project) there's code that we couldn't merge for licensing reasons.
>
> Note that where the project talks about tests, we mean automated
> tests, probably unit tests. It's worth looking at how xapian-core does
> these, because we'd expect a similar approach for xapian-letor. (I
> think you're already clear on that, but I wanted to make sure!)
>
> > 2. Going through v-hasu's GSoC 2014 code to understand extra
> > functionalities added by him and planning how to introduce code from
his
> > branch.
>
> Good.
>
> > 1. Creating a code example that lets the user use 100-objects-v1.csv
as
> the
> > database and use Letor features and API to make queries over it.
> > Documenting how to make this example run.
>
> Note again that master probably won't be sufficient to do this. The
> missing functionality (ie the unmerged work) was rewritten on v-hasu's
> (Hanxiao Sun) branch, so can be pulled from there to form the base.
>
> > 3. Writing API and unit tests
>
> Note as the project description states that these should be done
> alongside integrating work, rather than considered separately.
>
> > I have some question:
> >
> > 1. Is the procedure I mentioned above the right way to go about it?
What
> > are the essential portions (in terms of code) that I should complete
> before
> > submitting the proposal?
>
> It's not essential to complete any code ahead of the proposal, and as
> you have only a week now to do the proposal that needs to be your
> focus. Working with the code, however, is important to understand what
> work needs to done (and so will inform your proposal). So it's not
> necessary to be able to submit pull requests yet, but the work you've
> been doing in getting familiar with what code is there will form the
> basis of your proposal.
>
> > 2. How can I create the test harness for xapian-letor similar to
> > xapian-core and start writing tests? Tests seem somewhat overwhelming
to
> me
> > at the moment, it would be helpful if I could get some assistance on
how
> to
> > go about it.
>
> You'll need to copy the test harness. What I'd do is to copy the
whole
> of the xapian-core/tests directory, then cut out all the actual
> tests. What's left should be the harness and supporting code.
(You'll
> need to write some more support to
>
> > 3. How important is writing new features for this project (for
instance
> > implementing LambdaMART ranking)? Should I focus on them as well in my
> > proposal?
>
> Not at all. There's more than enough work in stabilising and
> integrating previous work, writing tests and documentation, and
> creating a fully-working system suitable for general use. If you were
> to integrate all of v-hasu's branch and get that merged, then
there's
> VcamX's (Jiarong Wei) work to look at from 2014, although that would
> require some more planning at the time (I wouldn't plan for that in
> your proposal).
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160321/496f875b/attachment.html>

Xapian devel - Mar 2016 - GSoC 2016 Letor Stabilisation

GSoC 2016 Letor Stabilisation

GSoC 2016 Letor Stabilisation

GSoC 2016 Letor Stabilisation