thr3ads.net - Xapian devel - [Xapian-devel] Some questions about Letor project [May 2014]

If this information is useful, please help other people find it:
Share via:

Jiarong Wei

2014-May-21 18:11 UTC

[Xapian-devel] Some questions about Letor project

Hi all,

Thank you for giving me the opportunity to work with Xapian :) I am Jiarong
Wei, a third year undergraduate student in Zhejiang University, China. In
GSoC 2014, I will work on Letor module with Hanxiao Sun.

Here are some questions I encountered these days,


   1. In letor.cc, we have two parts of functions: the training part and
   the ranking part. I?ll use SVMRanker as an example. The training part
   basically uses the libsvm library and training data to train a model, then
   save the model file. The ranking part will calculate score for each
   document in searching results (MSet) by using the trained model file. My
   question is for each of our three rankers: 1) SVMRanker 2) ListMLE 3)
   ListNet, do we need three different types of training part? (The ranking
   part for each of those have the same form I think) I?m not sure the
   parameters for these three different rankers are the same or not (I guess
   they?re different). In my understanding, the letor.cc basically just pass
   parameters ranker. It?s the ranker will do training and calculating things
   actually. So if we can generalize the form for training part, we don?t need
   functions like prepare_training_data_for_svm,
   prepare_training_data_for_listwise etc. We just need  prepare_training_data
   instead. (We can benefit from inheritance of ranker in training part just
   like in ranking part)
   2. There is one thing I have to confirm: once we have the training model
   (like model file of SVMRanker), we won?t train that model again in general.
   (The behavior of questletor.cc under bin/ confuses me)
   3. Since RankList will be removed, according to the meeting last week,
   its related information will be stored under MSet::Internal. My plan is to
   create new class under MSet::Internal. That class will have two kinds of
   feature vectors: normalized one and unnormalized one. Since it?s in
   MSet::Internal, there is a wrapper class outside it I think. So it also
   needs to provide corresponding APIs in that wrapper class. Also, the ranker
   will use MSet instead of RankList. Do you have any suggestions for this
   part?
   4. For FeatureVector, I think it could be discarded since it just stores
   the information of feature vector of  each document, those information will
   be stored in the new class in MSet::Internal mentioned in 3.
   5. For Feature (letor_feature.cc), I think it could be a static class.
   It mainly focuses on the calculation of different features. For this part,
   I?m trying to figure out a better method to implement it. In the meeting
   last week, Olly and Parth suggested using a dispatching function to
   calculating different kinds of features because different features, like
   query-related feature and document feature, will use different parameters
   to calculate. By adopting this method, we should write down every
   calculating method in the same class, it?s a little hard to extend to use
   more features. If a user wants to use his own feature, he need to modify
   our source code instead of adding his own thing and making letor module use
   it, like implementing his own feature calculation class and call letor
   module to use it. I just think it?s not that convenient to extend features.
   In GSoC 2014, I also need to implement a feature selection algorithm so
   this part I think it?s kind of important, I mean the extensibility of
   features.
   6. For FeatureManager, it will set the context for feature calculation,
   like set Database, set query and what kinds of features we want. It
   provides some basic information like term frequency and inverse document
   frequency etc. Also it will have function update_mset to touch feature
   information to MSet.
   7. For feature selection, I don?t know when to apply this selection. We
   will provide the features we want to use to FeatureManager. So the feature
   selection will provide some information like this feature is better so it
   will have larger weight? Or this algorithm will select subset of features
   we provide to generate feature vectors?
   8. Do we have document about unit test? That?s also what Hanxiao is
   looking for.
   9. For automated tests, my idea is to use some data to test the
   functionality of letor module. It will also cover different configurations,
   like using different rankers, to test the functionality. I think I need
   some help for this part. Can someone give me some advice?

Thanks for your help :)

Jiarong Wei
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140522/127eb712/attachment-0002.html>

Richard Boulton

2014-May-23 07:21 UTC

head link

[Xapian-devel] Some questions about Letor project

Hi Jiarong, and welcome.

For future reference (both for you, and for our other GSoC students), it's
best not to batch up communications, but to ask individual questions like
these as they come up.  I can often respond to a short email straight away:
it's taken me a while to find time to sit down and respond to this email.

Also, don't forget to update
http://trac.xapian.org/wiki/GSoC2014/Learning%20to%20Rank%20Jiarong%20Wei/Journaleach
day to say how you're getting on: I'm checking it daily but have seen
no updates yet.  Remember that we can only help you based on what you tell
us, and what code you push.  Don't be reluctant to push work-in-progress
code to github; it's often easier to discuss problems based around some
code you've tried making, even if that code doesn't work or is only a
sketch of an idea.

Try and be present on IRC when you're working; asking questions as they
come up there can be helpful.

On 21 May 2014 19:11, Jiarong Wei <vcamx3 at gmail.com> wrote:
> Here are some questions I encountered these days,
>
>
>    1. In letor.cc, we have two parts of functions: the training part and
>    the ranking part. I?ll use SVMRanker as an example. The training part
>    basically uses the libsvm library and training data to train a model,
then
>    save the model file. The ranking part will calculate score for each
>    document in searching results (MSet) by using the trained model file. My
>    question is for each of our three rankers: 1) SVMRanker 2) ListMLE 3)
>    ListNet, do we need three different types of training part? (The ranking
>    part for each of those have the same form I think) I?m not sure the
>    parameters for these three different rankers are the same or not (I
guess
>    they?re different). In my understanding, the letor.cc basically just
pass
>    parameters ranker. It?s the ranker will do training and calculating
things
>    actually. So if we can generalize the form for training part, we don?t
need
>    functions like prepare_training_data_for_svm,
>    prepare_training_data_for_listwise etc. We just need 
prepare_training_data
>    instead. (We can benefit from inheritance of ranker in training part
just
>    like in ranking part)
>
> In general, I think we need a different training part for each ranker. There may be some similarities in these existing rankers, and inheritance
would be a sensible way to avoid duplicating code if so, but we'd like to
have a framework which we can extend to a completely different type of
ranker in future.
>
>    1. There is one thing I have to confirm: once we have the training
>    model (like model file of SVMRanker), we won?t train that model again in
>    general. (The behavior of questletor.cc under bin/ confuses me)
>
> I'm not familiar with the behaviour of questletor, but I suppose
it'sreasonable to assume that we don't update models after initial creation.
 It would be nice to be able to do so, but I think many training algorithms
aren't updatable.  I feel I may be misunderstanding your question here,
though.  Parth: any comment to add?
>
>    1. Since RankList will be removed, according to the meeting last week,
>    its related information will be stored under MSet::Internal. My plan is
to
>    create new class under MSet::Internal. That class will have two kinds of
>    feature vectors: normalized one and unnormalized one. Since it?s in
>    MSet::Internal, there is a wrapper class outside it I think. So it also
>    needs to provide corresponding APIs in that wrapper class. Also, the
ranker
>    will use MSet instead of RankList. Do you have any suggestions for this
>    part?
>
> This sounds like a reasonable approach.  This sounds like something youcould implement very soon, and that is sufficiently standalone we could try
and get it merged to master on its own.
>
>    1. For FeatureVector, I think it could be discarded since it just
>    stores the information of feature vector of  each document, those
>    information will be stored in the new class in MSet::Internal mentioned
in
>    3.
>
> Sounds right to me.
>
>    1. For Feature (letor_feature.cc), I think it could be a static class.
>    It mainly focuses on the calculation of different features. For this
part,
>    I?m trying to figure out a better method to implement it. In the meeting
>    last week, Olly and Parth suggested using a dispatching function to
>    calculating different kinds of features because different features, like
>    query-related feature and document feature, will use different
parameters
>    to calculate. By adopting this method, we should write down every
>    calculating method in the same class, it?s a little hard to extend to
use
>    more features. If a user wants to use his own feature, he need to modify
>    our source code instead of adding his own thing and making letor module
use
>    it, like implementing his own feature calculation class and call letor
>    module to use it. I just think it?s not that convenient to extend
features.
>    In GSoC 2014, I also need to implement a feature selection algorithm so
>    this part I think it?s kind of important, I mean the extensibility of
>    features.
>
> I can't remember the details of this but what you're suggesting
sounds onthe right lines.  We certainly want to design for easy extensibility.
>
>    1. For FeatureManager, it will set the context for feature
>    calculation, like set Database, set query and what kinds of features we
>    want. It provides some basic information like term frequency and inverse
>    document frequency etc. Also it will have function update_mset to touch
>    feature information to MSet.
>
> Again, sounds plausible.
>
>    1. For feature selection, I don?t know when to apply this selection.
>    We will provide the features we want to use to FeatureManager. So the
>    feature selection will provide some information like this feature is
better
>    so it will have larger weight? Or this algorithm will select subset of
>    features we provide to generate feature vectors?
>
> I'd expect the feature selection to select a subset of features: but
it'salso very good for it to be able to return information that a human can
check over to see if it's making plausible decisions.
>
>    1. Do we have document about unit test? That?s also what Hanxiao is
>    looking for.
>
> We don't have many unit tests; there is
xapian-core/tests/internaltest.ccwhich runs some tests that could be considered unit tests.  Mostly, our
tests are what might be considered integration tests (ie, the apitest).
 The tests were set up before many of the modern testing conventions became
commonplace; it would be interesting to have a wider discussion about how
we could make it easier to implement unit tests.
>
>    1. For automated tests, my idea is to use some data to test the
>    functionality of letor module. It will also cover different
configurations,
>    like using different rankers, to test the functionality. I think I need
>    some help for this part. Can someone give me some advice?
>
> I'm not sure what advice you need; Parth - any ideas here?-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140523/fa98bc56/attachment-0002.html>

Parth Gupta

2014-May-23 19:33 UTC

head link

[Xapian-devel] Some questions about Letor project

Hi Jiarong,
>
>>    1. In letor.cc, we have two parts of functions: the training part
and
>>    the ranking part. I?ll use SVMRanker as an example. The training
part
>>    basically uses the libsvm library and training data to train a
model, then
>>    save the model file. The ranking part will calculate score for each
>>    document in searching results (MSet) by using the trained model
file. My
>>    question is for each of our three rankers: 1) SVMRanker 2) ListMLE
3)
>>    ListNet, do we need three different types of training part? (The
ranking
>>    part for each of those have the same form I think) I?m not sure the
>>    parameters for these three different rankers are the same or not (I
guess
>>    they?re different). In my understanding, the letor.cc basically just
pass
>>    parameters ranker. It?s the ranker will do training and calculating
things
>>    actually. So if we can generalize the form for training part, we
don?t need
>>    functions like prepare_training_data_for_svm,
>>    prepare_training_data_for_listwise etc. We just need 
prepare_training_data
>>    instead. (We can benefit from inheritance of ranker in training part
just
>>    like in ranking part)
>>
>> In general, I think we need a different training part for each ranker.
>  There may be some similarities in these existing rankers, and inheritance
> would be a sensible way to avoid duplicating code if so, but we'd like
to
> have a framework which we can extend to a completely different type of
> ranker in future.
>
Ideally, we decided to have only a single method like prepare_training_file
and it would be the responsibility of the Ranker's to interpret the data
the way they want for example, pairwise approaches need pairs and  so on.
The data format we have decided is the standard one and commonly used among
Letor community. Example taken from the SVM-rank page (
http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html). So at this
moment I would say, please focus on only one method and remove the others.
This should also be communicated to Hanxiao.
<line> .=. <target> qid:<qid> <feature>:<value>
<feature>:<value> ...
<feature>:<value> # <info>
<target> .=. <float>
<qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>
<info> .=. <string
>
>>    1. There is one thing I have to confirm: once we have the training
>>    model (like model file of SVMRanker), we won?t train that model
again in
>>    general. (The behavior of questletor.cc under bin/ confuses me)
>>
>> I'm not familiar with the behaviour of questletor, but I suppose
it's
> reasonable to assume that we don't update models after initial
creation.
>  It would be nice to be able to do so, but I think many training algorithms
> aren't updatable.  I feel I may be misunderstanding your question here,
> though.  Parth: any comment to add?
>
 questletor is just an example of how the code works. Once the model is
trained, you dont need to retrain it unless you really want to. So for the
better interpretation you can add a condition in questletor that train only
when model does not exist.
>
>>    1. Since RankList will be removed, according to the meeting last
>>    week, its related information will be stored under MSet::Internal.
My plan
>>    is to create new class under MSet::Internal. That class will have
two kinds
>>    of feature vectors: normalized one and unnormalized one. Since it?s
in
>>    MSet::Internal, there is a wrapper class outside it I think. So it
also
>>    needs to provide corresponding APIs in that wrapper class. Also, the
ranker
>>    will use MSet instead of RankList. Do you have any suggestions for
this
>>    part?
>>
>> This sounds like a reasonable approach.  This sounds like something you
> could implement very soon, and that is sufficiently standalone we could try
> and get it merged to master on its own.
>
I am not sure if you really need to store the normalised feature vector,
just a method to normalize should do the job. We should definitely consult
Hanxiao when he comes to a point where he sees storing a normalised version
will help in some way. Btw, which type of normalisation methods are you
talking about? If you refer to QueryLevelNorm (
http://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm) then you
that is the standard and your featurevector would be like that. Do you mean
to further normalise it?
>
>>    1. For FeatureVector, I think it could be discarded since it just
>>    stores the information of feature vector of  each document, those
>>    information will be stored in the new class in MSet::Internal
mentioned in
>>    3.
>>
>> Sounds right to me.
>
Okay, sounds fair but also please store the additional information such as
score and label as the featurevecor class currently does.
>
>>    1. For FeatureManager, it will set the context for feature
>>    calculation, like set Database, set query and what kinds of features
we
>>    want. It provides some basic information like term frequency and
inverse
>>    document frequency etc. Also it will have function update_mset to
touch
>>    feature information to MSet.
>>
>> Again, sounds plausible.
>
Btw we decided not to categorise features based on types like document
dependant, query dependent etc in the end but we agreed to give user the
power to select a subset of features may be in form of a list<Integer> or
something.
>
>>    1. For feature selection, I don?t know when to apply this selection.
>>    We will provide the features we want to use to FeatureManager. So
the
>>    feature selection will provide some information like this feature is
better
>>    so it will have larger weight? Or this algorithm will select subset
of
>>    features we provide to generate feature vectors?
>>
>> I'd expect the feature selection to select a subset of features:
but it's
> also very good for it to be able to return information that a human can
> check over to see if it's making plausible decisions.
>
Both the feature selection algorithms mentioned on the Letor ProjectIdea
page are subset selection based. It is one time and happens before the
training. These algorithms will give each feature a score that how
important each feature is and the user needs to select top N features based
on some educated heuristics presented in the corresponding paper and the
computational power at disposal.

>
>>    1. For automated tests, my idea is to use some data to test the
>>    functionality of letor module. It will also cover different
configurations,
>>    like using different rankers, to test the functionality. I think I
need
>>    some help for this part. Can someone give me some advice?
>>
>> I'm not sure what advice you need; Parth - any ideas here?
>
The test concerning to xapina-letor would be mainly focused around the
features and the rankers. So what you can do is use a small test collection
with a few documents and check if the features calculated are correct or
not, the ranking using each ranker is acceptable or not etc.

Cheers,
Parth.
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140523/e467406a/attachment-0002.html>

Xapian devel - May 2014 - Some questions about Letor project

[Xapian-devel] Some questions about Letor project

[Xapian-devel] Some questions about Letor project

[Xapian-devel] Some questions about Letor project