Parth Gupta
2011-Jun-07 06:54 UTC
[Xapian-devel] Introduction and Discussion for Learning to Rank Framework
Hello All, This time we are working on a weighting scheme "Learning to Rank" which involves machine learning and its a supervised ranking scheme unlike unsupervised schemes like BM25 under GSoC project. This mail intends to discuss the framework of the Learning to Rank in Xapian as a whole. I have thought of the following framework, pour in your insights or issues for the same. This is also put on the wiki for the rererencen [Link] <http://trac.xapian.org/wiki/GSoC2011/LTR/LTRFramework>. First of all I put the structure and tentative elements[methods] of the Xapian::Letor class and then will explain the whole flow of the 'Letor' ranking. Structure of the Xapian::Letor class methods: Following Five methods basically prepare the statistical information needed to generate the values of desired features. tf() idf() doc_len() coll_tf() coll_len() Following six methods will calucate the particular feature values with the help of above statistical information. Here char ch plays a roll to tell the method that this features value is to be calculated for which documents part. For example [ ch = 't' says calculate feature 1 for title only , 'b' - body only , 'w' - whole document. calculate_f1(... , char ch) calculate_f2(... , char ch) calculate_f3(... , char ch) calculate_f4(... , char ch) calculate_f5(... , char ch) calculate_f6(... , char ch) Following methods are very general methods which uses above all the methods in order to make a platform between xapian and Machine Learning. void prepare_training_file() - This method prepares the training file which is used by the Machine Learning(ML) Algorithm in our case SVM[Support Vector Machine Tool] to build the model. This method will be made public because user can prepare its own training file suitable to his/her application, supplying necessary things like index[dataset/corpus], query file and their relevance judgements file. This method will generate the training file if the desired input files are in standard format which will be made public. Though we will provide the training file and model file with the standard distributions which is quite general for direct use. void learn_model() - this method trains the SVM model with the use of training file, and writes the model into a model file which can be used to rank the document. double learn_score() - Here we may calculate the score of the document using the model file with the generated vector. Flow of the program: We have to make a quest like utility say questletor, here we will get the initial ranklist in MSet. Now for each document we will get the feature values. Calculation of this features may be placed in the learn_score() method or there can be an independent method like make_feature_vector() output of which can be passed to learn_score() method. This method will return the Letor score of the document. New Ranklist based on the Letor Scores will be returned. methods like prepare_training_file() and learn_model() will come into picture only when user wishes to use his/her own data or want different type of domain dependent learning. I would like to mention that the training file and model file will be distributed along the standard version and they are quite general for the purpose because the data used is wikipedia and also the training file will be normalised at query level in order to make all the variations of queries like general queries or specific queries etc identical. Another thing is only those documents are considered for training file whose relevance judgements are available. Extendibility of the Structure: This framework can be extended in two ways - with respect to features and other is with respect to Machine Learning(ML) Algorithm. For the features, methods like calculate_f1() can be written for new features and can be called from the area where final vector is being made. ML algorithms demands that training file vectors and testing vectors should be of same dimension. So if we want to add/remove features, new training file needs to be generated which is very straight forward because we will have to just add/drop calculate_f() kind of methods' calls. The format we have selected of the training file is very standard and common to most of the Letor tools available and if it demands the data in different format then we can write multiple training file in each suitable format. Regards, Parth Gupta, http://sites.google.com/site/parthg88 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110607/6a071ec4/attachment-0001.html>