thr3ads.net - Xapian devel - [Xapian-devel] Dealing with negative weights [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Aarsh Shah

2013-Jun-20 11:40 UTC

[Xapian-devel] Dealing with negative weights

Hello guys. I am currently working on the DLH weighting scheme .The formula
for DLH is very complex and it ends up giving negative weights to some
documents because of the formula.Due to this,inspite of having
occurence/occurences of the keyword, the documents with negative weights
don't show up in the results at all. Please can I get some help on how to
deal with this ? Or should I just leave it as it is and let the poor
documents suffer by virtue of them having statistics not suitable for DLH ?

-Regards
-Aarsh
-GSOC student for Debian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130620/1e91b9a5/attachment-0002.html>

Olly Betts

2013-Jun-21 11:11 UTC

head link

[Xapian-devel] Dealing with negative weights

On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah
wrote:> Hello guys. I am currently working on the DLH weighting scheme .The formula
> for DLH is very complex and it ends up giving negative weights to some
> documents because of the formula.Due to this,inspite of having
> occurence/occurences of the keyword, the documents with negative weights
> don't show up in the results at all. Please can I get some help on how
to
> deal with this ? Or should I just leave it as it is and let the poor
> documents suffer by virtue of them having statistics not suitable for DLH ?
Xapian assumes each component of the weight sum is positive - if you
return a negative component, the matcher optimisations will go wrong.

If there's a lower bound, you might be able to address this by
subtracting that bound and adjusting the term-independent component to
compensate for the terms which don't match each document.

E.g. if the weight contributed by query term t in doc d is W(t,d) and
Wi(d) is the term independent component, then the weight for document d
is:

  W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)

If we have a lower bound for W(t,d):

  (a) W_low(t) <= W(t,d) for all d

And it's negative (or if the weight for a given term is always >= 0,
just make this lower bound zero):

  (b) W_low(t) <= 0

And similarly for Wi(d):

  (c) Wi_low <= Wi(d) for all d

And let's only adjust Wi if we have to:

  (d) Wi_low <= 0

Then you can transform your weighting scheme to this one:

  W'(t,d) = W(t,d) - W_low(t)
  then:  W'(t,d) >= 0  (from (a))

  Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
  so:  Wi'(d) >= Wi(d) - Wi_low  (from (b))
  so:  Wi'(d) >= 0  (from (c))

And the total weight for document d is:

  W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)
    = Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in
d}(W_low(t))
    = Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))
    = W_sum(d) - Wi_low - Sum{t}(W_low(t))

So that's simply added something to every weight which is constant for
a given query on a given database - the relative ordering of the weights
is preserved.

Cheers,
    Olly

Aarsh Shah

2013-Jun-22 07:11 UTC

head link

[Xapian-devel] Dealing with negative weights

I was adding the calculations for a lower bound to get_sumpart() (DLH has
no term independent component) when I realized that the same lower bound
will be calculated for each term-docment pair that get_sumpart is called
pair which basically reduces efficiency. How do I calculate the lower bound
for a term only once and then use it ?

-Regards
-Aarsh


On Fri, Jun 21, 2013 at 4:41 PM, Olly Betts <olly at survex.com> wrote:
> On Thu, Jun 20, 2013 at 05:10:30PM +0530, Aarsh Shah wrote:
> > Hello guys. I am currently working on the DLH weighting scheme .The
> formula
> > for DLH is very complex and it ends up giving negative weights to some
> > documents because of the formula.Due to this,inspite of having
> > occurence/occurences of the keyword, the documents with negative
weights
> > don't show up in the results at all. Please can I get some help on
how to
> > deal with this ? Or should I just leave it as it is and let the poor
> > documents suffer by virtue of them having statistics not suitable for
> DLH ?
>
> Xapian assumes each component of the weight sum is positive - if you
> return a negative component, the matcher optimisations will go wrong.
>
> If there's a lower bound, you might be able to address this by
> subtracting that bound and adjusting the term-independent component to
> compensate for the terms which don't match each document.
>
> E.g. if the weight contributed by query term t in doc d is W(t,d) and
> Wi(d) is the term independent component, then the weight for document d
> is:
>
>   W_sum(d) = Sum{t in d}(W(t,d)) + Wi(d)
>
> If we have a lower bound for W(t,d):
>
>   (a) W_low(t) <= W(t,d) for all d
>
> And it's negative (or if the weight for a given term is always >= 0,
> just make this lower bound zero):
>
>   (b) W_low(t) <= 0
>
> And similarly for Wi(d):
>
>   (c) Wi_low <= Wi(d) for all d
>
> And let's only adjust Wi if we have to:
>
>   (d) Wi_low <= 0
>
> Then you can transform your weighting scheme to this one:
>
>   W'(t,d) = W(t,d) - W_low(t)
>   then:  W'(t,d) >= 0  (from (a))
>
>   Wi'(d) = Wi(d) - Wi_low - Sum{t not in d}(W_low(t))
>   so:  Wi'(d) >= Wi(d) - Wi_low  (from (b))
>   so:  Wi'(d) >= 0  (from (c))
>
> And the total weight for document d is:
>
>   W_sum'(d) = Sum{t in d}(W'(t,d)) + Wi'(d)
>     = Sum{t in d}{W(t,d) - W_low(t)) + Wi(d) - Wi_low - Sum{t not in
> d}(W_low(t))
>     = Sum{t in d}(W(t,d)) + Wi(d) - Wi_low - Sum{t}(W_low(t))
>     = W_sum(d) - Wi_low - Sum{t}(W_low(t))
>
> So that's simply added something to every weight which is constant for
> a given query on a given database - the relative ordering of the weights
> is preserved.
>
> Cheers,
>     Olly
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130622/88a2724c/attachment-0002.html>

Possibly Parallel Threads

Search for more apparently analagous threads

Xapian devel - Jun 2013 - Dealing with negative weights

[Xapian-devel] Dealing with negative weights

[Xapian-devel] Dealing with negative weights

[Xapian-devel] Dealing with negative weights

Possibly Parallel Threads