Hi everyone,
I'm a final(fourth) year university student from India pursuing my
Bachelors in Engineering, majoring in Information Technology. Information
Retrieval is one of the courses I am studying in my current semester, so
I'm well versed with a lot of concepts in IR like the various retrieval
models. I'm also comfortbale coding in C++. Hence I am interested in
applying as a potential GSoC student for Xapian. Prior to this, I have
interned at Autodesk where I worked on the Fusion 360 product which is also
written in C++. I also have a few projects hosted on my GitHub (
https://github.com/Nishad94)
The idea that most interested me is "Project: Weighting Schemes",
which is
concerned with adding support for more Weighting Schemes to Xapian. As
mentioned on https://trac.xapian.org/wiki/GSoC%20Guide, I have successfuly
checked out and built the code on my system. Since the past few days I have
been trying to get familiar with the code by reading the available
documentation.
I wanted to discuss the implementation of the TF/IDF normalization schemes
described by SMART, which are not currently supported in Xapian. In order
to get started on the project with something small, I was thinking of a way
to implement the ' max-norm[new-tf = tf / max-tf] ' normalization for
the
term frequency(tf) component. The additional statistic which would be
required in order to support this is the max-tf for a given document. This
could be supported by modifying the get_sumpart() function for a weighting
scheme to accept an additional argument for max-tf. The responsibility of
providing this argument would lie with the caller of this function, which
seems to be a PostList object. Similar to the functions
PostList::get_doclength() and PostList::get_unique_terms(), we can add a
virtual function called get_max_wdf() which returns the max-tf for the
current document in the PostList. For its implementation, the underlying db
would also need a method which provides this value for a given
Xapian::docid. This can be achieved by creating a TermList object for this
docid and then iterating over all the terms to find the maximum
within-doc-freq [It can be cached for faster access the next time it is
needed].
I would like to know if the above is feasible, and whether I should start
implementing it. Any other input would be highly appreciated.
Thanks,
Nishad Dawkhar (IRC: ndawk94)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/7e6e5f61/attachment-0001.html>