thr3ads.net - Xapian discuss - Weighting the author of a doc when that term can also appear as a frequent term in other docs [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Alex Aminoff

2017-Sep-28 17:27 UTC

Weighting the author of a doc when that term can also appear as a frequent term in other docs

We have a corpus of academic papers. Sometimes it happens that there is 
an academic controversy and one paper is a response or rebuttal to 
another paper. The name of the author of the first paper may appear many 
times in the second paper. So in light of this, how should we set our 
weight on the author field?

Here is an example:

http://www.nber.org/papers/w11215

  in which the term "Hoxby" appears 315 times, referring to several 
previous papers by Hoxby

http://www.nber.org/papers/w11216

  in which the term "Rothstein" is used 47 times

So if a user searches for "Hoxby", I would prefer that the comment on 
Hoxby not utterly dominate search results for which Hoxby is the author. 
But I don't want to set the weight on the author field to like 300, that 
would cause a search for "Moore's Law" to be dominated by results 
written by authors named Moore.

One suggestion someone had was what if the 300th mention of Hoxby was 
not as important as the first. I tried to read

  https://xapian.org/docs/bm25.html

and I think I conclude that as long as f is small relative to L or K, 
the value of the expression will increase linearly with f. To make it 
less than linear, we might invoke
> BM25 originally introduced another constant, as a power to which f and 
> K are raised. However, Stephen remarks that powers other than 1 were 
> /'not helpful'/, and other tests confirm this, so Xapian's 
> implementation of BM25 ignores this.
>If I could raise f to a power less than 1, that would do what I want. 
But I am not at all sure this is the right approach.

Perhaps in real use this will turn out to be a minor issue.

  - Alex

Olly Betts

2017-Oct-05 05:31 UTC

head link

Weighting the author of a doc when that term can also appear as a frequent term in other docs

On Thu, Sep 28, 2017 at 01:27:18PM -0400, Alex Aminoff
wrote:> We have a corpus of academic papers. Sometimes it happens that there is an
> academic controversy and one paper is a response or rebuttal to another
> paper. The name of the author of the first paper may appear many times in
> the second paper. So in light of this, how should we set our weight on the
> author field?
> 
> Here is an example:
> 
> http://www.nber.org/papers/w11215
> 
>  in which the term "Hoxby" appears 315 times, referring to
several previous
> papers by Hoxby
> 
> http://www.nber.org/papers/w11216
> 
>  in which the term "Rothstein" is used 47 times
> 
> So if a user searches for "Hoxby", I would prefer that the
comment on Hoxby
> not utterly dominate search results for which Hoxby is the author. But I
> don't want to set the weight on the author field to like 300, that
would
> cause a search for "Moore's Law" to be dominated by results
written by
> authors named Moore.
> 
> One suggestion someone had was what if the 300th mention of Hoxby was not
as
> important as the first.
That's a valid intuition, and if you look at the various weighting formulae,
it's usually true that each additional occurrence adds less importance than
the previous one.  For some formulae extra occurrences may eventually even
reduce the importance.
> I tried to read
> 
>  https://xapian.org/docs/bm25.html
> 
> and I think I conclude that as long as f is small relative to L or K, the
> value of the expression will increase linearly with f.
If b is 0, then that term varies with f as:

  (k1 + 1) f / (k1 + f)

In the limit this tends to (k1 + 1) from below (though f can't grow without
limit as f <= L), and it is approximately linear for small f, but f >= 1
and an integer so e.g. with default k1=1 then f = 1, 2, 3, 4 gives 1, ~1.33,
1.5, 1.6 and the limit is 2 so it becomes sublinear quite quickly.

If b is 1, K is k1.L and that term varies with f as:

  (k1 + 1) f / (k1.L + f)
= (k1 + 1) Q / (k1 + Q)    where Q = f/L

1 <= f <= L so 0 < Q <= 1 (though 1 means f = L which is a document
consisting
of just one word repeated over and over - in real data 0 < Q << 1).

At the default k1=1, if you graph this it is approximately linear close to
0 and drops below linear more as Q increases.  Again, not every value is
seen.

Values of b between 0 and 1 effectively blend the two, so smaller b will
probably work better in this regard (the default is b=0.5), though smaller
b will tend to prefer longer documents which can be undesirable.
> To make it less than linear, we might invoke
>
> >BM25 originally introduced another constant, as a power to which f and
K
> >are raised. However, Stephen remarks that powers other than 1 were
/'not
> >helpful'/, and other tests confirm this, so Xapian's
implementation of
> >BM25 ignores this.
>
> If I could raise f to a power less than 1, that would do what I want. But I
> am not at all sure this is the right approach.
I've never seen a BM25 implementation which includes this, and in fact most
descriptions of BM25 don't even mention it - I think it was really an idea
during BM25's development which was abandonned.

It could be implemented, but it's probably of limited use, and will be
slower (partly because of a lot of calls to pow(), but also because it'll
make it harder to give a good upper bound).

I'd first look at tuning the available parameters to BM25Weight, and if
that doesn't solve the problem, I'd try some of the other weighting
schemes
that Xapian already supports.  For example DPHWeight is apparently quite
effective at penalising term spamming, which is a similar problem to the
one you raise.
> Perhaps in real use this will turn out to be a minor issue.
It's certainly possible to overthink such issues.

Cheers,
    Olly

Possibly Parallel Threads

Search for more maybe matching threads

Xapian discuss - Sep 2017 - Weighting the author of a doc when that term can also appear as a frequent term in other docs

Weighting the author of a doc when that term can also appear as a frequent term in other docs

Weighting the author of a doc when that term can also appear as a frequent term in other docs

Possibly Parallel Threads