thr3ads.net - Xapian devel - GSoC 2016 "Project : Weighting Scheme" Intro [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Nishad Dawkhar

2016-Mar-07 10:30 UTC

GSoC 2016 "Project : Weighting Scheme" Intro

Hi everyone,

I'm a final(fourth) year university student from India pursuing my
Bachelors in Engineering, majoring in Information Technology. Information
Retrieval is one of the courses I am studying in my current semester, so
I'm well versed with a lot of concepts in IR like the various retrieval
models. I'm also comfortbale coding in C++. Hence I am interested in
applying as a potential GSoC student for Xapian. Prior to this, I have
interned at Autodesk where I worked on the Fusion 360 product which is also
written in C++. I also have a few projects hosted on my GitHub (
https://github.com/Nishad94)

The idea that most interested me is "Project: Weighting Schemes",
which is
concerned with adding support for more Weighting Schemes to Xapian. As
mentioned on https://trac.xapian.org/wiki/GSoC%20Guide, I have successfuly
checked out and built the code on my system. Since the past few days I have
been trying to get familiar with the code by reading the available
documentation.

I wanted to discuss the implementation of the TF/IDF normalization schemes
described by SMART, which are not currently supported in Xapian. In order
to get started on the project with something small, I was thinking of a way
to implement the ' max-norm[new-tf = tf / max-tf] ' normalization for
the
term frequency(tf) component. The additional statistic which would be
required in order to support this is the max-tf for a given document. This
could be supported by modifying the get_sumpart() function for a weighting
scheme to accept an additional argument for max-tf. The responsibility of
providing this argument would lie with the caller of this function, which
seems to be a PostList object. Similar to the functions
PostList::get_doclength() and PostList::get_unique_terms(), we can add a
virtual function called get_max_wdf() which returns the max-tf for the
current document in the PostList. For its implementation, the underlying db
would also need a method which provides this value for a given
Xapian::docid. This can be achieved by creating a TermList object for this
docid and then iterating over all the terms to find the maximum
within-doc-freq [It can be cached for faster access the next time it is
needed].

I would like to know if the above is feasible, and whether I should start
implementing it. Any other input would be highly appreciated.

Thanks,
Nishad Dawkhar (IRC: ndawk94)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/7e6e5f61/attachment-0001.html>

James Aylett

2016-Mar-07 15:49 UTC

head link

GSoC 2016 "Project : Weighting Scheme" Intro

On Mon, Mar 07, 2016 at 10:30:39AM +0000, Nishad Dawkhar wrote:
> The idea that most interested me is "Project: Weighting Schemes",
which is
> concerned with adding support for more Weighting Schemes to Xapian. As
> mentioned on https://trac.xapian.org/wiki/GSoC%20Guide, I have successfuly
> checked out and built the code on my system. Since the past few days I have
> been trying to get familiar with the code by reading the available
> documentation.
Hi Nishad -- welcome to Xapian! If you find any of the documentation
is confusing or missing pieces (and there are definitely gaps), please
do point them out so we can put them on a list, and try to get them
fixed. (If you want to suggest improvements directly, then that's
great too, but don't worry if you just spot something wrong or
confusing and don't know what needs to change.)

I think you've discussed with Olly on IRC about adding more
normalizations to our TF/IDF support, so I won't respond to that part
of your email.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Xapian devel - Mar 2016 - GSoC 2016 "Project : Weighting Scheme" Intro

GSoC 2016 "Project : Weighting Scheme" Intro

GSoC 2016 "Project : Weighting Scheme" Intro