thr3ads.net - Xapian devel - GSoc 2017 Introduction(Weighting Schemes) [Mar 2017]

If this information is useful, please help other people find it:
Share via:

prachi prakash

2017-Mar-05 15:11 UTC

GSoc 2017 Introduction(Weighting Schemes)

Hello Everyone,

I am a second year graduate student at IIIT-Bangalore and my interest is in
the field of Information Retrieval. I have successfully compiled Xapian
from source  and have implemented some examples. While going through the
project list Weighting Schemes project is the one I was looking to
contribute to. So i went through the xapian-core/weight where most of the
schemes are already present and I also went through the Bigram-model which
was outside the tree and not merged yet.

So can Anyone of please give a pointer to which weighting schemes are not
implemented yet so that I can start looking at it.

Regards,
Prachi Prakash
Final year Graduate Student
LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
github: https://github.com/PrachiPrakash?tab=activity
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170305/9d664b7d/attachment.html>

Olly Betts

2017-Mar-06 21:31 UTC

head link

GSoc 2017 Introduction(Weighting Schemes)

On Sun, Mar 05, 2017 at 08:41:46PM +0530, prachi prakash
wrote:> Hello Everyone,
Hi prachi,
> So can Anyone of please give a pointer to which weighting schemes are not
> implemented yet so that I can start looking at it.
We don't have a list of unimplemented schemes - we're expecting students
interested in adding new weighting schemes to do a bit of reading of the
academic literature to find something that interests them.

The only real requirements are that it needs to be feasible to implement within
the existing structure and to make a suitable scope of project for 3 months of
full-time work (given your skills and experience).

Cheers,
    Olly

prachi prakash

2017-Mar-12 22:00 UTC

head link

GSoc 2017 Introduction(Weighting Schemes)

Hi Olly,

Thanks for an early reply. I looked a bit deep into the tf-idf
implementation and found that the following document length normalizations
are not implemented [1].

1) Cosine normalization
2)Sum of weights normalization
3) Fourth Normalization
4) Max weight normalization

All the normalization factor being a constant at the document level, for
each combination of wdf and idf weighting scheme (that are already
implemented)  the above document normalization factors should be stored in
the backend(index).

Furthermore, I was thinking  while weighting each term multiplying the
document  normalization factor can be redundant, so can we have a abstract
function like get_mulextra in Weight class which would return a term
independent document normalization factor which can be multiplied to the
weight of the document for the query to get the final weight(rank) of the
document for a particular query.

Please suggest am I thinking in the correct direction.

References:
Nicola Polettini. The Vector Space model in Information Retrieval - Term
Weighting Problem. January 2004.

Regards,
Prachi Prakash
Final year Graduate Student
LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
github: https://github.com/PrachiPrakash?tab=activity

On Sun, Mar 5, 2017 at 8:41 PM, prachi prakash <prachiprakash80 at
gmail.com>
wrote:
> Hello Everyone,
>
> I am a second year graduate student at IIIT-Bangalore and my interest is
> in the field of Information Retrieval. I have successfully compiled Xapian
> from source  and have implemented some examples. While going through the
> project list Weighting Schemes project is the one I was looking to
> contribute to. So i went through the xapian-core/weight where most of the
> schemes are already present and I also went through the Bigram-model which
> was outside the tree and not merged yet.
>
> So can Anyone of please give a pointer to which weighting schemes are not
> implemented yet so that I can start looking at it.
>
> Regards,
> Prachi Prakash
> Final year Graduate Student
> LinkedIn: https://www.linkedin.com/in/prachi-prakash-7b674351/
> github: https://github.com/PrachiPrakash?tab=activity
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170313/14e51780/attachment.html>

Olly Betts

2017-Mar-21 04:30 UTC

head link

GSoc 2017 Introduction(Weighting Schemes)

On Mon, Mar 13, 2017 at 03:30:44AM +0530, prachi prakash
wrote:> Thanks for an early reply. I looked a bit deep into the tf-idf
> implementation and found that the following document length normalizations
> are not implemented [1].
> 
> 1) Cosine normalization
> 2)Sum of weights normalization
> 3) Fourth Normalization
> 4) Max weight normalization
There's also the "pivoted unique" normalisation, as linked from
the project
idea resources list.
> All the normalization factor being a constant at the document level, for
> each combination of wdf and idf weighting scheme (that are already
> implemented)  the above document normalization factors should be stored in
> the backend(index).
Unless the IDF norm is "none", these norms can't just be factored
out of
the equation.  For example, consider SMART "bfm" - there we need the
maximum
value of 1/n(t) for any term in the query which occurs in the document
being weighted, where n(t) is the number of different documents which term t
occurs in (Xapian calls this "term frequency" but that phrase is sadly
overloaded with multiple meanings in the literature).  That's not a
per-document constant factor you can pre-compute.

Cheers,
    Olly

Xapian devel - Mar 2017 - GSoc 2017 Introduction(Weighting Schemes)

GSoc 2017 Introduction(Weighting Schemes)

GSoc 2017 Introduction(Weighting Schemes)

GSoc 2017 Introduction(Weighting Schemes)

GSoc 2017 Introduction(Weighting Schemes)