thr3ads.net - Xapian devel - xapian-letor: FeatureVector discussion [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Ayush Tomar

2016-Jun-29 12:28 UTC

xapian-letor: FeatureVector discussion

>
>
>
> The approach I was thinking would look something like this:
>
>  * instead of Features, which is really a namespace implemented as a
>    class, we separate out the calculation of the different features
>    into distinct subclasses of Feature, whose only job is to calculate
>    a single feature. Currently the FeatureManager calls these (via
>    FeatureManager::Internal::transform) with the correct arguments,
>    things like document statistics or tf or idf caches. This is
>    analogous to how Weight objects can request various statistics, and
>    the Enquire process then makes them available. So we can do it in a
>    similar way (Feature declares that it needs tf and doclen, for
>    instance, and FeatureManager can make sure they're available to the
>    Feature when it's building a FeatureVector for a given document).

Yes. Features can get their own subdirectory with each Feature subclass
having its own implementation. We can have FeatureManager do all the
feature handling corresponding to a query. FeatureManager can have
vector<Features*> FeatureList, which initialises each Feature sub-class
object mentioned in the FeatureList (supplied at the time of
training/ranking or as deafult set).

At present, letor is mostly centred around RankList (for both training and
ranking), whereas RankList is just a vector of FeatureVectors corresponding
to a qid. Having RankList in ranking has no meaning since qid isn't
required once the training part is over. (letor_rank(*) method in
letor_internal.cc supplies a junk qid to the RankList while performing the
ranking, which points out that the RankList approach isn't quite correct).

Hence, RankList can be completely eliminated and instead we can have
FeatureVector work on top of FeatureManager directly. Am I right?
>
>  * letor itself (during scoring) operates on FeatureVectors,
>    representing Documents, and uses this to rerank an MSet; it does
>    something similar during preparation of its training data. So how
>    the FeatureVector is calculated just needs to be done the same
>    in both situations.
>
Yes. At present Xapian::RankList create_rank_list(const Xapian::MSet &
mset, std::string & qid, bool train) defined in FeatureManager does the job
of preparing the FeatureVector for a query. At the time of preparing the
training file, FeatureVector calculation can be done while parsing the
query and qrel file,  independently of RankList. Therefore, eliminating the
need of maintaing a global qrel storage (map<string, map<string, int>
>
qrel; in FeatureManager::Internal) and thus eliminating the need of
load_relevance(*) and getlabel(*) functions. The score in FeatureVector is
simply the label, and fvals will be returned by FeatureManager (by using
feature values obtained from each of the Feature sub-class).

While ranking, FeatureVector fvals will be computed similarly by
FeatureManager, while the score gets assigned later at the time of ranking.

>  * when configuring the letor system either for training or for
>    reranking, we construct a FeatureList(*) (which is basically a
>    vector<Feature>), which we can later ask to generate a
>    FeatureVector for a given document. (This splits some of the
>    functionality of FeatureManager, but makes it more clear what each
>    piece does.)
>  * if you just construct a FeatureList, you'll get whatever the
>    defaults should be. If you want to set your own features, you do
>    that at construction time. That can include custom features, which
>    wouldn't be possible under the enum model without editing
>    xapian-letor and rebuilding it, which isn't friendly to
>    developers.
>
>  * Features becomes FeatureList, but with some functionality from
>    FeatureManager. It's responsible for turning a Document into a
>    FeatureVector, for the letor system to operate on.
>
FeatureList can tell the vector<Features*> FeatureList object in
FeatureManager as to what Features sub-classes to initialize. A
vector<double> fval(*) function in FeatureManager can operate over
vector<Features*> FeatureList to return fvals to the FeatureVector. Maybe
your meaning of FeatureList is something different. Can you please explain?

>
>  * Ranker should really be responsible for doing most of the work
>    currently done by Letor. (Preparing training files, training the
>    ranking algorithm &c.)
>
Preparing training file is limited to FeatureVector calculation only. Would
there be a specific reason to include it in ranker?
>
>  * The rest of FeatureManager is really utilities (which can be
>    functions in the Xapian::Letor namespace, or methods on whichever
>    class makes sense). For instance, load_relevance() has nothing to
>    do with features; it's part of the training stage. (It's also on
>    FeatureVector, with effectively the same implementation.)
>
Yes. These functions are not defined at correct places. If we decide to
eliminate RankList, most of these methods will have no meaning and
therefore will be removed.
>
>  * RankList is mostly a list of FeatureVectors, ie it's close to the
>    thing we care about at the end. The final output we want is
>    actually a ranked list of Documents, but this is almost the same
>    thing.
>
> (*) FeatureList isn't a great name, but I didn't want to adopt
> anything too close to what we have, as it'd be confusing.
>
As I have asked above, I think there is a misunderstanding between my
interpretation of FeatureList and yours. Please correct me where I am wrong.
>
> We shouldn't need serialisation of FeatureList or Feature, because
> this stuff doesn't have to persist, just be consistent, which is an
> issue very similar to getting prefixes right. Either a higher-level
> application has configuration, or it's in shared code between the
> indexing/training and querying/reranking parts of the system.
>
I understand this now. Its the user's job to make it consistent. Thanks for
clarifying this.

J>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>

-- 
----------------------------------------------------------------------------
Kind Regards,
Ayush Tomar | My Webpage <http://ayshtmr.xyz> | LinkedIn
<https://in.linkedin.com/in/ayushtomar>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160629/c7664970/attachment.html>

James Aylett

2016-Jun-29 18:47 UTC

head link

xapian-letor: FeatureVector discussion

On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote:
> At present, letor is mostly centred around RankList (for both training and
> ranking), whereas RankList is just a vector of FeatureVectors corresponding
> to a qid. Having RankList in ranking has no meaning since qid isn't
> required once the training part is over. (letor_rank(*) method in
> letor_internal.cc supplies a junk qid to the RankList while performing the
> ranking, which points out that the RankList approach isn't quite
correct).
Ah, I'd missed or forgotten that last detail. Getting rid of RankList
for the output is therefore probably a good idea; in that case, we
should return another MSet. (Or something that looks and behaves like
one.)
> Hence, RankList can be completely eliminated and instead we can have
> FeatureVector work on top of FeatureManager directly. Am I right?
I think we'll end up with FeatureVector, Feature (and its subclasses),
and possibly one other class which might be called FeatureManager but
which would be different to the current one in its
responsibilities. (This is why I gave it a different name of
FeatureList for the time being. That's probably less confusing than
calling it 'Jeff' ;-)
> The score in FeatureVector is simply the label, and fvals will be
> returned by FeatureManager (by using feature values obtained from
> each of the Feature sub-class).
Again, this is FeatureList not FeatureManager. (The thing that makes
fvals, except that it'll actually just make a FeatureVector for that
Document(*). During preparation this doesn't matter, but a more direct
connection during re-ranking of the MSet should make it easier to
return something like an MSet, with the same ease of access to the
Document object again.)

(*) in the context of the relevant Query
> >  * Features becomes FeatureList, but with some functionality from
> >    FeatureManager. It's responsible for turning a Document into a
> >    FeatureVector, for the letor system to operate on.
> 
> FeatureList can tell the vector<Features*> FeatureList object in
> FeatureManager as to what Features sub-classes to initialize.
Or it can just have a vector<Feature> (or Feature&) that it uses
directly; the FeatureList constructor will either initialise this with
a default set of Feature objects, or take an iterator over them or
something (we could have `add_feature(Feature&)` too). (Note that
`Features`, with an 's', is a utility namespace at the moment which we
should try to get rid of.)
> A vector<double> fval(*) function in FeatureManager can operate over
> vector<Features*> FeatureList to return fvals to the
> FeatureVector. Maybe your meaning of FeatureList is something
> different. Can you please explain?
I was thinking more like:

Document doc; // we have one of these already
FeatureList flist = FeatureList(); // default Feature choice
FeatureVector fvec = flist.create_fvec(doc);

So we don't make a FeatureVector and then poke things into it, we just
return one that represents a particular Document. The responsibility
for making a FeatureVector out of a Document is the thing that knows
which Feature objects to use, in which order, which is the FeatureList.
> >  * Ranker should really be responsible for doing most of the work
> >    currently done by Letor. (Preparing training files, training the
> >    ranking algorithm &c.)
> 
> Preparing training file is limited to FeatureVector calculation
> only. Would there be a specific reason to include it in ranker?
It feels to me slightly closer to that than anything
else. Alternatively, it could just live in the Xapian::Letor namespace
as a utility function.
> > We shouldn't need serialisation of FeatureList or Feature, because
> > this stuff doesn't have to persist, just be consistent, which is
an
> > issue very similar to getting prefixes right. Either a higher-level
> > application has configuration, or it's in shared code between the
> > indexing/training and querying/reranking parts of the system.
> 
> I understand this now. Its the user's job to make it
> consistent. Thanks for clarifying this.
Note that if an application or library wants to serialise this
configuration into the Xapian database, it can use Database metadata
so it'll be carried around with the rest of the db. (Of course,
chances are the trained data file for the SVM or whatever won't be
stored in the same way, unless you also stuff them into metadata. The
wisdom of doing this I don't know; Olly may have an opinion.)

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Parth Gupta

2016-Jul-02 07:34 UTC

head link

xapian-letor: FeatureVector discussion

Making a Feature abstract class is indeed a good way than serialising them
using enum. Now I get it better. All other bits look fine to me. One way to
handle the features serialisation to be written into output files is to use
two separate files: i) the training file which is in the standard letor
format with ids starting from 1 and increasing and; ii) features properties
file with id mappings to the actual Feature sub-class. This might be less
resource intensive and dependencies compared to going to databases.

Cheers
Parth

On Thu, Jun 30, 2016 at 12:17 AM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote:
>
> > At present, letor is mostly centred around RankList (for both training
> and
> > ranking), whereas RankList is just a vector of FeatureVectors
> corresponding
> > to a qid. Having RankList in ranking has no meaning since qid
isn't
> > required once the training part is over. (letor_rank(*) method in
> > letor_internal.cc supplies a junk qid to the RankList while performing
> the
> > ranking, which points out that the RankList approach isn't quite
> correct).
>
> Ah, I'd missed or forgotten that last detail. Getting rid of RankList
> for the output is therefore probably a good idea; in that case, we
> should return another MSet. (Or something that looks and behaves like
> one.)
>
> > Hence, RankList can be completely eliminated and instead we can have
> > FeatureVector work on top of FeatureManager directly. Am I right?
>
> I think we'll end up with FeatureVector, Feature (and its subclasses),
> and possibly one other class which might be called FeatureManager but
> which would be different to the current one in its
> responsibilities. (This is why I gave it a different name of
> FeatureList for the time being. That's probably less confusing than
> calling it 'Jeff' ;-)
>
> > The score in FeatureVector is simply the label, and fvals will be
> > returned by FeatureManager (by using feature values obtained from
> > each of the Feature sub-class).
>
> Again, this is FeatureList not FeatureManager. (The thing that makes
> fvals, except that it'll actually just make a FeatureVector for that
> Document(*). During preparation this doesn't matter, but a more direct
> connection during re-ranking of the MSet should make it easier to
> return something like an MSet, with the same ease of access to the
> Document object again.)
>
> (*) in the context of the relevant Query
>
> > >  * Features becomes FeatureList, but with some functionality from
> > >    FeatureManager. It's responsible for turning a Document
into a
> > >    FeatureVector, for the letor system to operate on.
> >
> > FeatureList can tell the vector<Features*> FeatureList object in
> > FeatureManager as to what Features sub-classes to initialize.
>
> Or it can just have a vector<Feature> (or Feature&) that it uses
> directly; the FeatureList constructor will either initialise this with
> a default set of Feature objects, or take an iterator over them or
> something (we could have `add_feature(Feature&)` too). (Note that
> `Features`, with an 's', is a utility namespace at the moment which
we
> should try to get rid of.)
>
> > A vector<double> fval(*) function in FeatureManager can operate
over
> > vector<Features*> FeatureList to return fvals to the
> > FeatureVector. Maybe your meaning of FeatureList is something
> > different. Can you please explain?
>
> I was thinking more like:
>
> Document doc; // we have one of these already
> FeatureList flist = FeatureList(); // default Feature choice
> FeatureVector fvec = flist.create_fvec(doc);
>
> So we don't make a FeatureVector and then poke things into it, we just
> return one that represents a particular Document. The responsibility
> for making a FeatureVector out of a Document is the thing that knows
> which Feature objects to use, in which order, which is the FeatureList.
>
> > >  * Ranker should really be responsible for doing most of the work
> > >    currently done by Letor. (Preparing training files, training
the
> > >    ranking algorithm &c.)
> >
> > Preparing training file is limited to FeatureVector calculation
> > only. Would there be a specific reason to include it in ranker?
>
> It feels to me slightly closer to that than anything
> else. Alternatively, it could just live in the Xapian::Letor namespace
> as a utility function.
>
> > > We shouldn't need serialisation of FeatureList or Feature,
because
> > > this stuff doesn't have to persist, just be consistent, which
is an
> > > issue very similar to getting prefixes right. Either a
higher-level
> > > application has configuration, or it's in shared code between
the
> > > indexing/training and querying/reranking parts of the system.
> >
> > I understand this now. Its the user's job to make it
> > consistent. Thanks for clarifying this.
>
> Note that if an application or library wants to serialise this
> configuration into the Xapian database, it can use Database metadata
> so it'll be carried around with the rest of the db. (Of course,
> chances are the trained data file for the SVM or whatever won't be
> stored in the same way, unless you also stuff them into metadata. The
> wisdom of doing this I don't know; Olly may have an opinion.)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160702/91c89ff5/attachment.html>

Xapian devel - Jun 2016 - xapian-letor: FeatureVector discussion

xapian-letor: FeatureVector discussion

xapian-letor: FeatureVector discussion

xapian-letor: FeatureVector discussion