thr3ads.net - Xapian devel - [Xapian-devel] Implementing tf-idf weighting scheme in Xapian [Feb 2013]

If this information is useful, please help other people find it:
Share via:

aarsh shah

2013-Feb-19 17:51 UTC

[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

Hello guys.I just read up about tf-idf schemes and want to implement it in
Xapian (with some frequently used normalizations) as it will also give me a
good hang of implementing a weighting scheme before I start working on
implementing DFR schemes.

I read the following as references and I think Ive understood it well and
can write the hack :-

1.)
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
2.) http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html
3.) http://en.wikipedia.org/wiki/Tf%E2%80%93idf

The basic philosophy is that rare terms (terms which occur  in a few
documents) should be able to give a higher weight to the documents they
index compared to terms which occur in many documents .Also,the higher the
within document frequency in the document ,more is the weight  given by the
term to the document.

The basic formula is W(t,d)=wdf* log(N/termfreq)  .

However,various normalizations can be applied to both wdf and idf.

The extra per document component will be 0 here and so get_maxextra( ) will
return 0 .

Moreover,an upper bound on W(t,d)  for get_maxpart( ) can be found out
easily for a particular normalization (if I have all the required metrics
available).

For eg:- If I am using logarithmic normalization for the wdf (within
document frequency) ,then an upper bound on W(t,d) will be
(log(wdf_upperbound_)+1)*log(N/termfreq)  as N(collection size) and
termfreq(number of documents indexed by the term t) will remain constant
for a given term t.

However,some normalizations for the wdf   include the formula wdfn = wdf /
max(wdf,d) where max(wdf,d) is the maximum within document frequency of any
term in the document .This metric is not provided by the need_stat( )
function of the Xapian::Weight class and so I don't know how to procure
it.Please can someone help me that ?

I will work on implementing weight normalization (like cosine normalization
) once I am done implementing the scheme with various  wdf and idf
normalizations.

Please let me know what you'll think,want to start working And I'm sorry
for being late with modifying the stemmer patch based on the feedback,have
tests going on at university.

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130219/a954368a/attachment-0001.html>

Olly Betts

2013-Feb-19 22:28 UTC

head link

[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

On Tue, Feb 19, 2013 at 11:21:14PM +0530, aarsh shah
wrote:> The basic philosophy is that rare terms (terms which occur  in a few
> documents) should be able to give a higher weight to the documents they
> index compared to terms which occur in many documents .Also,the higher the
> within document frequency in the document ,more is the weight  given by the
> term to the document.
> 
> The basic formula is W(t,d)=wdf* log(N/termfreq)  .
> 
> However,various normalizations can be applied to both wdf and idf.
Both the original probabilistic formula and BM25 actually fit in this
pattern too (aside from the per-document component in BM25).
> The extra per document component will be 0 here and so get_maxextra( ) will
> return 0 .
Indeed.
> Moreover,an upper bound on W(t,d)  for get_maxpart( ) can be found out
> easily for a particular normalization (if I have all the required metrics
> available).
> 
> For eg:- If I am using logarithmic normalization for the wdf (within
> document frequency) ,then an upper bound on W(t,d) will be
> (log(wdf_upperbound_)+1)*log(N/termfreq)  as N(collection size) and
> termfreq(number of documents indexed by the term t) will remain constant
> for a given term t.
Yes.
> However,some normalizations for the wdf   include the formula wdfn = wdf /
> max(wdf,d) where max(wdf,d) is the maximum within document frequency of any
> term in the document .This metric is not provided by the need_stat( )
> function of the Xapian::Weight class and so I don't know how to procure
> it.Please can someone help me that ?
We don't currently store that, and you can't efficiently calculate it on
the fly, so you'd have to alter the backends to store this statistic.

I would suggest you look at the weighting schemes which don't need new
stats first, and then look at ones which do once you're more familiar
with implementing weighting schemes.

Cheers,
    Olly

aarsh shah

2013-Feb-20 05:09 UTC

head link

[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

TF-IDF also has many normalizations which will work based on all the
statistics we currently provide.Ill send in a patch for a new TFIDF weight
class implementing all the normalizations I can with the current statistics
.Once it is up and running,I'll work on rewriting the backend for
additional statistics as you said.My final  aim is to provide all
normalizations mentioned here:-

1.)
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html

-Regards
-Aarsh

On Tue, Feb 19, 2013 at 11:21 PM, aarsh shah <aarshkshah1992 at
gmail.com>wrote:
> Hello guys.I just read up about tf-idf schemes and want to implement it in
> Xapian (with some frequently used normalizations) as it will also give me a
> good hang of implementing a weighting scheme before I start working on
> implementing DFR schemes.
>
> I read the following as references and I think Ive understood it well and
> can write the hack :-
>
> 1.)
>
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
> 2.) http://classes.seattleu.edu/computer_science/csse470/Madani/ABCs.html
> 3.) http://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
> The basic philosophy is that rare terms (terms which occur  in a few
> documents) should be able to give a higher weight to the documents they
> index compared to terms which occur in many documents .Also,the higher the
> within document frequency in the document ,more is the weight  given by the
> term to the document.
>
> The basic formula is W(t,d)=wdf* log(N/termfreq)  .
>
> However,various normalizations can be applied to both wdf and idf.
>
> The extra per document component will be 0 here and so get_maxextra( )
> will return 0 .
>
> Moreover,an upper bound on W(t,d)  for get_maxpart( ) can be found out
> easily for a particular normalization (if I have all the required metrics
> available).
>
> For eg:- If I am using logarithmic normalization for the wdf (within
> document frequency) ,then an upper bound on W(t,d) will be
> (log(wdf_upperbound_)+1)*log(N/termfreq)  as N(collection size) and
> termfreq(number of documents indexed by the term t) will remain constant
> for a given term t.
>
> However,some normalizations for the wdf   include the formula wdfn = wdf /
> max(wdf,d) where max(wdf,d) is the maximum within document frequency of any
> term in the document .This metric is not provided by the need_stat( )
> function of the Xapian::Weight class and so I don't know how to procure
> it.Please can someone help me that ?
>
> I will work on implementing weight normalization (like cosine
> normalization ) once I am done implementing the scheme with various  wdf
> and idf normalizations.
>
> Please let me know what you'll think,want to start working And I'm
sorry
> for being late with modifying the stemmer patch based on the feedback,have
> tests going on at university.
>
> -Regards
> -Aarsh
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130220/7f2dce89/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

Xapian devel - Feb 2013 - Implementing tf-idf weighting scheme in Xapian

[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

[Xapian-devel] Implementing tf-idf weighting scheme in Xapian

Reasonably Related Threads