> > > > The approach I was thinking would look something like this: > > * instead of Features, which is really a namespace implemented as a > class, we separate out the calculation of the different features > into distinct subclasses of Feature, whose only job is to calculate > a single feature. Currently the FeatureManager calls these (via > FeatureManager::Internal::transform) with the correct arguments, > things like document statistics or tf or idf caches. This is > analogous to how Weight objects can request various statistics, and > the Enquire process then makes them available. So we can do it in a > similar way (Feature declares that it needs tf and doclen, for > instance, and FeatureManager can make sure they're available to the > Feature when it's building a FeatureVector for a given document).Yes. Features can get their own subdirectory with each Feature subclass having its own implementation. We can have FeatureManager do all the feature handling corresponding to a query. FeatureManager can have vector<Features*> FeatureList, which initialises each Feature sub-class object mentioned in the FeatureList (supplied at the time of training/ranking or as deafult set). At present, letor is mostly centred around RankList (for both training and ranking), whereas RankList is just a vector of FeatureVectors corresponding to a qid. Having RankList in ranking has no meaning since qid isn't required once the training part is over. (letor_rank(*) method in letor_internal.cc supplies a junk qid to the RankList while performing the ranking, which points out that the RankList approach isn't quite correct). Hence, RankList can be completely eliminated and instead we can have FeatureVector work on top of FeatureManager directly. Am I right?> > * letor itself (during scoring) operates on FeatureVectors, > representing Documents, and uses this to rerank an MSet; it does > something similar during preparation of its training data. So how > the FeatureVector is calculated just needs to be done the same > in both situations. >Yes. At present Xapian::RankList create_rank_list(const Xapian::MSet & mset, std::string & qid, bool train) defined in FeatureManager does the job of preparing the FeatureVector for a query. At the time of preparing the training file, FeatureVector calculation can be done while parsing the query and qrel file, independently of RankList. Therefore, eliminating the need of maintaing a global qrel storage (map<string, map<string, int> > qrel; in FeatureManager::Internal) and thus eliminating the need of load_relevance(*) and getlabel(*) functions. The score in FeatureVector is simply the label, and fvals will be returned by FeatureManager (by using feature values obtained from each of the Feature sub-class). While ranking, FeatureVector fvals will be computed similarly by FeatureManager, while the score gets assigned later at the time of ranking.> * when configuring the letor system either for training or for > reranking, we construct a FeatureList(*) (which is basically a > vector<Feature>), which we can later ask to generate a > FeatureVector for a given document. (This splits some of the > functionality of FeatureManager, but makes it more clear what each > piece does.)> * if you just construct a FeatureList, you'll get whatever the > defaults should be. If you want to set your own features, you do > that at construction time. That can include custom features, which > wouldn't be possible under the enum model without editing > xapian-letor and rebuilding it, which isn't friendly to > developers. > > * Features becomes FeatureList, but with some functionality from > FeatureManager. It's responsible for turning a Document into a > FeatureVector, for the letor system to operate on. >FeatureList can tell the vector<Features*> FeatureList object in FeatureManager as to what Features sub-classes to initialize. A vector<double> fval(*) function in FeatureManager can operate over vector<Features*> FeatureList to return fvals to the FeatureVector. Maybe your meaning of FeatureList is something different. Can you please explain?> > * Ranker should really be responsible for doing most of the work > currently done by Letor. (Preparing training files, training the > ranking algorithm &c.) >Preparing training file is limited to FeatureVector calculation only. Would there be a specific reason to include it in ranker?> > * The rest of FeatureManager is really utilities (which can be > functions in the Xapian::Letor namespace, or methods on whichever > class makes sense). For instance, load_relevance() has nothing to > do with features; it's part of the training stage. (It's also on > FeatureVector, with effectively the same implementation.) >Yes. These functions are not defined at correct places. If we decide to eliminate RankList, most of these methods will have no meaning and therefore will be removed.> > * RankList is mostly a list of FeatureVectors, ie it's close to the > thing we care about at the end. The final output we want is > actually a ranked list of Documents, but this is almost the same > thing. > > (*) FeatureList isn't a great name, but I didn't want to adopt > anything too close to what we have, as it'd be confusing. >As I have asked above, I think there is a misunderstanding between my interpretation of FeatureList and yours. Please correct me where I am wrong.> > We shouldn't need serialisation of FeatureList or Feature, because > this stuff doesn't have to persist, just be consistent, which is an > issue very similar to getting prefixes right. Either a higher-level > application has configuration, or it's in shared code between the > indexing/training and querying/reranking parts of the system. >I understand this now. Its the user's job to make it consistent. Thanks for clarifying this. J> > -- > James Aylett, occasional trouble-maker > xapian.org > >-- ---------------------------------------------------------------------------- Kind Regards, Ayush Tomar | My Webpage <http://ayshtmr.xyz> | LinkedIn <https://in.linkedin.com/in/ayushtomar> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160629/c7664970/attachment.html>
On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote:> At present, letor is mostly centred around RankList (for both training and > ranking), whereas RankList is just a vector of FeatureVectors corresponding > to a qid. Having RankList in ranking has no meaning since qid isn't > required once the training part is over. (letor_rank(*) method in > letor_internal.cc supplies a junk qid to the RankList while performing the > ranking, which points out that the RankList approach isn't quite correct).Ah, I'd missed or forgotten that last detail. Getting rid of RankList for the output is therefore probably a good idea; in that case, we should return another MSet. (Or something that looks and behaves like one.)> Hence, RankList can be completely eliminated and instead we can have > FeatureVector work on top of FeatureManager directly. Am I right?I think we'll end up with FeatureVector, Feature (and its subclasses), and possibly one other class which might be called FeatureManager but which would be different to the current one in its responsibilities. (This is why I gave it a different name of FeatureList for the time being. That's probably less confusing than calling it 'Jeff' ;-)> The score in FeatureVector is simply the label, and fvals will be > returned by FeatureManager (by using feature values obtained from > each of the Feature sub-class).Again, this is FeatureList not FeatureManager. (The thing that makes fvals, except that it'll actually just make a FeatureVector for that Document(*). During preparation this doesn't matter, but a more direct connection during re-ranking of the MSet should make it easier to return something like an MSet, with the same ease of access to the Document object again.) (*) in the context of the relevant Query> > * Features becomes FeatureList, but with some functionality from > > FeatureManager. It's responsible for turning a Document into a > > FeatureVector, for the letor system to operate on. > > FeatureList can tell the vector<Features*> FeatureList object in > FeatureManager as to what Features sub-classes to initialize.Or it can just have a vector<Feature> (or Feature&) that it uses directly; the FeatureList constructor will either initialise this with a default set of Feature objects, or take an iterator over them or something (we could have `add_feature(Feature&)` too). (Note that `Features`, with an 's', is a utility namespace at the moment which we should try to get rid of.)> A vector<double> fval(*) function in FeatureManager can operate over > vector<Features*> FeatureList to return fvals to the > FeatureVector. Maybe your meaning of FeatureList is something > different. Can you please explain?I was thinking more like: Document doc; // we have one of these already FeatureList flist = FeatureList(); // default Feature choice FeatureVector fvec = flist.create_fvec(doc); So we don't make a FeatureVector and then poke things into it, we just return one that represents a particular Document. The responsibility for making a FeatureVector out of a Document is the thing that knows which Feature objects to use, in which order, which is the FeatureList.> > * Ranker should really be responsible for doing most of the work > > currently done by Letor. (Preparing training files, training the > > ranking algorithm &c.) > > Preparing training file is limited to FeatureVector calculation > only. Would there be a specific reason to include it in ranker?It feels to me slightly closer to that than anything else. Alternatively, it could just live in the Xapian::Letor namespace as a utility function.> > We shouldn't need serialisation of FeatureList or Feature, because > > this stuff doesn't have to persist, just be consistent, which is an > > issue very similar to getting prefixes right. Either a higher-level > > application has configuration, or it's in shared code between the > > indexing/training and querying/reranking parts of the system. > > I understand this now. Its the user's job to make it > consistent. Thanks for clarifying this.Note that if an application or library wants to serialise this configuration into the Xapian database, it can use Database metadata so it'll be carried around with the rest of the db. (Of course, chances are the trained data file for the SVM or whatever won't be stored in the same way, unless you also stuff them into metadata. The wisdom of doing this I don't know; Olly may have an opinion.) J -- James Aylett, occasional trouble-maker xapian.org
Making a Feature abstract class is indeed a good way than serialising them using enum. Now I get it better. All other bits look fine to me. One way to handle the features serialisation to be written into output files is to use two separate files: i) the training file which is in the standard letor format with ids starting from 1 and increasing and; ii) features properties file with id mappings to the actual Feature sub-class. This might be less resource intensive and dependencies compared to going to databases. Cheers Parth On Thu, Jun 30, 2016 at 12:17 AM, James Aylett <james-xapian at tartarus.org> wrote:> On Wed, Jun 29, 2016 at 05:58:17PM +0530, Ayush Tomar wrote: > > > At present, letor is mostly centred around RankList (for both training > and > > ranking), whereas RankList is just a vector of FeatureVectors > corresponding > > to a qid. Having RankList in ranking has no meaning since qid isn't > > required once the training part is over. (letor_rank(*) method in > > letor_internal.cc supplies a junk qid to the RankList while performing > the > > ranking, which points out that the RankList approach isn't quite > correct). > > Ah, I'd missed or forgotten that last detail. Getting rid of RankList > for the output is therefore probably a good idea; in that case, we > should return another MSet. (Or something that looks and behaves like > one.) > > > Hence, RankList can be completely eliminated and instead we can have > > FeatureVector work on top of FeatureManager directly. Am I right? > > I think we'll end up with FeatureVector, Feature (and its subclasses), > and possibly one other class which might be called FeatureManager but > which would be different to the current one in its > responsibilities. (This is why I gave it a different name of > FeatureList for the time being. That's probably less confusing than > calling it 'Jeff' ;-) > > > The score in FeatureVector is simply the label, and fvals will be > > returned by FeatureManager (by using feature values obtained from > > each of the Feature sub-class). > > Again, this is FeatureList not FeatureManager. (The thing that makes > fvals, except that it'll actually just make a FeatureVector for that > Document(*). During preparation this doesn't matter, but a more direct > connection during re-ranking of the MSet should make it easier to > return something like an MSet, with the same ease of access to the > Document object again.) > > (*) in the context of the relevant Query > > > > * Features becomes FeatureList, but with some functionality from > > > FeatureManager. It's responsible for turning a Document into a > > > FeatureVector, for the letor system to operate on. > > > > FeatureList can tell the vector<Features*> FeatureList object in > > FeatureManager as to what Features sub-classes to initialize. > > Or it can just have a vector<Feature> (or Feature&) that it uses > directly; the FeatureList constructor will either initialise this with > a default set of Feature objects, or take an iterator over them or > something (we could have `add_feature(Feature&)` too). (Note that > `Features`, with an 's', is a utility namespace at the moment which we > should try to get rid of.) > > > A vector<double> fval(*) function in FeatureManager can operate over > > vector<Features*> FeatureList to return fvals to the > > FeatureVector. Maybe your meaning of FeatureList is something > > different. Can you please explain? > > I was thinking more like: > > Document doc; // we have one of these already > FeatureList flist = FeatureList(); // default Feature choice > FeatureVector fvec = flist.create_fvec(doc); > > So we don't make a FeatureVector and then poke things into it, we just > return one that represents a particular Document. The responsibility > for making a FeatureVector out of a Document is the thing that knows > which Feature objects to use, in which order, which is the FeatureList. > > > > * Ranker should really be responsible for doing most of the work > > > currently done by Letor. (Preparing training files, training the > > > ranking algorithm &c.) > > > > Preparing training file is limited to FeatureVector calculation > > only. Would there be a specific reason to include it in ranker? > > It feels to me slightly closer to that than anything > else. Alternatively, it could just live in the Xapian::Letor namespace > as a utility function. > > > > We shouldn't need serialisation of FeatureList or Feature, because > > > this stuff doesn't have to persist, just be consistent, which is an > > > issue very similar to getting prefixes right. Either a higher-level > > > application has configuration, or it's in shared code between the > > > indexing/training and querying/reranking parts of the system. > > > > I understand this now. Its the user's job to make it > > consistent. Thanks for clarifying this. > > Note that if an application or library wants to serialise this > configuration into the Xapian database, it can use Database metadata > so it'll be carried around with the rest of the db. (Of course, > chances are the trained data file for the SVM or whatever won't be > stored in the same way, unless you also stuff them into metadata. The > wisdom of doing this I don't know; Olly may have an opinion.) > > J > > -- > James Aylett, occasional trouble-maker > xapian.org > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160702/91c89ff5/attachment.html>