Hello Everyone, I am a second year graduate student at IIIT-Bangalore and my interest is in the field of Information Retrieval. I have successfully compiled Xapian from source and have implemented some examples. While going through the project list Weighting Schemes project is the one I was looking to contribute to. So i went through the xapian-core/weight where most of the schemes are already present and I also went through the Bigram-model which was outside the tree and not merged yet. So can Anyone of please give a pointer to which weighting schemes are not implemented yet so that I can start looking at it. Regards, Prachi Prakash Final year Graduate Student LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/ github: https://github.com/PrachiPrakash?tab=activity -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170305/9d664b7d/attachment.html>
On Sun, Mar 05, 2017 at 08:41:46PM +0530, prachi prakash wrote:> Hello Everyone,Hi prachi,> So can Anyone of please give a pointer to which weighting schemes are not > implemented yet so that I can start looking at it.We don't have a list of unimplemented schemes - we're expecting students interested in adding new weighting schemes to do a bit of reading of the academic literature to find something that interests them. The only real requirements are that it needs to be feasible to implement within the existing structure and to make a suitable scope of project for 3 months of full-time work (given your skills and experience). Cheers, Olly
Hi Olly, Thanks for an early reply. I looked a bit deep into the tf-idf implementation and found that the following document length normalizations are not implemented [1]. 1) Cosine normalization 2)Sum of weights normalization 3) Fourth Normalization 4) Max weight normalization All the normalization factor being a constant at the document level, for each combination of wdf and idf weighting scheme (that are already implemented) the above document normalization factors should be stored in the backend(index). Furthermore, I was thinking while weighting each term multiplying the document normalization factor can be redundant, so can we have a abstract function like get_mulextra in Weight class which would return a term independent document normalization factor which can be multiplied to the weight of the document for the query to get the final weight(rank) of the document for a particular query. Please suggest am I thinking in the correct direction. References: Nicola Polettini. The Vector Space model in Information Retrieval - Term Weighting Problem. January 2004. Regards, Prachi Prakash Final year Graduate Student LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/ github: https://github.com/PrachiPrakash?tab=activity On Sun, Mar 5, 2017 at 8:41 PM, prachi prakash <prachiprakash80 at gmail.com> wrote:> Hello Everyone, > > I am a second year graduate student at IIIT-Bangalore and my interest is > in the field of Information Retrieval. I have successfully compiled Xapian > from source and have implemented some examples. While going through the > project list Weighting Schemes project is the one I was looking to > contribute to. So i went through the xapian-core/weight where most of the > schemes are already present and I also went through the Bigram-model which > was outside the tree and not merged yet. > > So can Anyone of please give a pointer to which weighting schemes are not > implemented yet so that I can start looking at it. > > Regards, > Prachi Prakash > Final year Graduate Student > LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/ > github: https://github.com/PrachiPrakash?tab=activity >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170313/14e51780/attachment.html>
On Mon, Mar 13, 2017 at 03:30:44AM +0530, prachi prakash wrote:> Thanks for an early reply. I looked a bit deep into the tf-idf > implementation and found that the following document length normalizations > are not implemented [1]. > > 1) Cosine normalization > 2)Sum of weights normalization > 3) Fourth Normalization > 4) Max weight normalizationThere's also the "pivoted unique" normalisation, as linked from the project idea resources list.> All the normalization factor being a constant at the document level, for > each combination of wdf and idf weighting scheme (that are already > implemented) the above document normalization factors should be stored in > the backend(index).Unless the IDF norm is "none", these norms can't just be factored out of the equation. For example, consider SMART "bfm" - there we need the maximum value of 1/n(t) for any term in the query which occurs in the document being weighted, where n(t) is the number of different documents which term t occurs in (Xapian calls this "term frequency" but that phrase is sadly overloaded with multiple meanings in the literature). That's not a per-document constant factor you can pre-compute. Cheers, Olly