thr3ads.net - Xapian devel - Weighting Schemes: Evaluation results [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Vivek Pal

2016-Aug-07 18:02 UTC

Weighting Schemes: Evaluation results

Hi,

Evaluation of pivoted normalization ("PPP") of tf-idf weighting scheme
is
also complete now. I have also evaluated the default tf-idf normalization
("ntn") and other normalizations combinations involving pivoted
normalization in wdfn, idfn and wtn component as "Pxx",
"xPx" and "xxP"
normalization strings respectively to have a clear idea about which one
does better job of retrieving relevant documents.

All results of evaluation runs can be easily accessed here:
https://gist.github.com/ivmarkp

Comparing the MAP of "PPP" with that of "ntn" normalization,
we get results
as follows:

PPP : 0.0607107
ntn : 0.109525

Clearly, the default normalization does a better job here than pivoted
normalization but since we intended to have support for pivoted
normalization in Xapian rather making a replacement of default
normalization with pivoted normalization, I think this comparison may not
come as a big surprise.

Similarly, the MAP of Ptn, nPn and ntP which represent "Pxx",
"xPx" and
"xxP" normalization strings respectively are as follows:

ntP: 0.0747668
nPn: 0.0676789
Ptn: 0.11379

Interestingly, Ptn normalization does fairly good job than all other
normalizations and the default normalization ("ntn") as well. So, I
think
it can be recommended for applications based on news corpus to definitely
use Ptn normalization if exploring options beyond default tf-idf
normalization.

As a small side note -- now I'm planning to take up additional tasks we
were looking to work on in the end but before that I was wondering if this
is the right time to complete the documentation part of BM25+, PL2+, Dir+
and Piv+ weighting schemes and also if PRs for these weighting schemes can
be merged upstream finally?  Please let me know if there are any loose ends
that might need some work before PRs can be merged.

Regards,
Vivek
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160807/a357e2d2/attachment.html>

James Aylett

2016-Aug-08 18:07 UTC

head link

Weighting Schemes: Evaluation results

On Sun, Aug 07, 2016 at 11:32:27PM +0530, Vivek Pal wrote:
> All results of evaluation runs can be easily accessed here:
> https://gist.github.com/ivmarkp
Hey, that's great!
> Comparing the MAP of "PPP" with that of "ntn"
normalization, we get results
> as follows:
> 
> PPP : 0.0607107
> ntn : 0.109525
> 
> Clearly, the default normalization does a better job here than pivoted
> normalization but since we intended to have support for pivoted
> normalization in Xapian rather making a replacement of default
> normalization with pivoted normalization, I think this comparison may not
> come as a big surprise.
Hmm. It'd be nice if we knew what sort of corpus PPP would be good
for; is there something suggestive in the literature?
> Similarly, the MAP of Ptn, nPn and ntP which represent "Pxx",
"xPx" and
> "xxP" normalization strings respectively are as follows:
> 
> ntP: 0.0747668
> nPn: 0.0676789
> Ptn: 0.11379
> 
> Interestingly, Ptn normalization does fairly good job than all other
> normalizations and the default normalization ("ntn") as well. So,
I think
> it can be recommended for applications based on news corpus to definitely
> use Ptn normalization if exploring options beyond default tf-idf
> normalization.
Sounds good!
> As a small side note -- now I'm planning to take up additional tasks
> we were looking to work on in the end but before that I was
> wondering if this is the right time to complete the documentation
> part of BM25+, PL2+, Dir+ and Piv+ weighting schemes
Trying to complete the documentation I think is the right priority.
> and also if PRs for these weighting schemes can be merged upstream
> finally?  Please let me know if there are any loose ends that might
> need some work before PRs can be merged.
Assuming you've addressed all the earlier comments (which I think you
have), I think it's down to us at this point :-)

I've been holding back on merging largely because I have a host of
other things going on. I don't see any significant hold ups other than
that, although I'm not sure (because I haven't had to deal with it
before) in what way we need to change the ABI number for these
changes. Not sure if Olly has been following this work closely enough
to be able to comment, or if we're going to have to find some time to
sit down and figure it out (along with whether we merge these changes
into 1.4.x).

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Vivek Pal

2016-Aug-09 16:32 UTC

head link

Weighting Schemes: Evaluation results

> It'd be nice if we knew what sort of corpus PPP would be good
> for; is there something suggestive in the literature?
There isn't anything specifics mentioned for Piv+ similar to what we
had for BM25+ previously but I'm positive that corpuses used are
four TREC collections: WT2G, WT10G, Terabyte, and Robust04, which
basically represent different sizes and genre of text collections.
> Trying to complete the documentation I think is the right priority.
Okay, I'm on it -- will soon open PRs for the same.
> Assuming you've addressed all the earlier comments (which I think you
> have), I think it's down to us at this point :-)
Thanks, that's great. Just to make sure everything is in place, I'll
take a
quick
glance over things again.
> I don't see any significant hold ups other than
> that, although I'm not sure (because I haven't had to deal with it
> before) in what way we need to change the ABI number for these
> changes.
I think I have little to add here. Although, I can recall that you had
mentioned
in the mid-term meeting that these changes should go into the 1.5 series
instead
of a brand new thing in the 1.4 release or something similar if I remember
correctly. :)

Actually, as now that the submission week is nearing, I was wondering what
best fits the list
of different pieces of project work that have been merged or should it be
fine to list them
as work that hasn't been merged ?

Thanks,
Vivek


On Mon, Aug 8, 2016 at 11:37 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Sun, Aug 07, 2016 at 11:32:27PM +0530, Vivek Pal wrote:
>
> > All results of evaluation runs can be easily accessed here:
> > https://gist.github.com/ivmarkp
>
> Hey, that's great!
>
> > Comparing the MAP of "PPP" with that of "ntn"
normalization, we get
> results
> > as follows:
> >
> > PPP : 0.0607107
> > ntn : 0.109525
> >
> > Clearly, the default normalization does a better job here than pivoted
> > normalization but since we intended to have support for pivoted
> > normalization in Xapian rather making a replacement of default
> > normalization with pivoted normalization, I think this comparison may
not
> > come as a big surprise.
>
> Hmm. It'd be nice if we knew what sort of corpus PPP would be good
> for; is there something suggestive in the literature?
>
> > Similarly, the MAP of Ptn, nPn and ntP which represent
"Pxx", "xPx" and
> > "xxP" normalization strings respectively are as follows:
> >
> > ntP: 0.0747668
> > nPn: 0.0676789
> > Ptn: 0.11379
> >
> > Interestingly, Ptn normalization does fairly good job than all other
> > normalizations and the default normalization ("ntn") as
well. So, I think
> > it can be recommended for applications based on news corpus to
definitely
> > use Ptn normalization if exploring options beyond default tf-idf
> > normalization.
>
> Sounds good!
>
> > As a small side note -- now I'm planning to take up additional
tasks
> > we were looking to work on in the end but before that I was
> > wondering if this is the right time to complete the documentation
> > part of BM25+, PL2+, Dir+ and Piv+ weighting schemes
>
> Trying to complete the documentation I think is the right priority.
>
> > and also if PRs for these weighting schemes can be merged upstream
> > finally?  Please let me know if there are any loose ends that might
> > need some work before PRs can be merged.
>
> Assuming you've addressed all the earlier comments (which I think you
> have), I think it's down to us at this point :-)
>
> I've been holding back on merging largely because I have a host of
> other things going on. I don't see any significant hold ups other than
> that, although I'm not sure (because I haven't had to deal with it
> before) in what way we need to change the ABI number for these
> changes. Not sure if Olly has been following this work closely enough
> to be able to comment, or if we're going to have to find some time to
> sit down and figure it out (along with whether we merge these changes
> into 1.4.x).
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160809/95d72ed0/attachment.html>

Xapian devel - Aug 2016 - Weighting Schemes: Evaluation results

Weighting Schemes: Evaluation results

Weighting Schemes: Evaluation results

Weighting Schemes: Evaluation results