thr3ads.net - Xapian discuss - [Xapian-discuss] query time stemming and term weights [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2005-Nov-16 19:15 UTC

[Xapian-discuss] query time stemming and term weights

I am developping a personal/desktop search tool for which I am
experimenting with doing no stemming during the indexing, but instead
having a stem database (or several for different languages), used for
expanding the query terms at search time.
 (ie: user query: flooring -> stem: floor
     -> final query for: [floored flooring floorings floors])

I have thought of a possible problem with weighting when using this
approach, I am not really confident in my knowledge of how things are
computed, so I am not sure that this is an actual issue.

The problem is with term frequencies. When doing the stemming at index
time, the term frequency will be for the stem, more or less the sum of derived
terms frequencies.

My concern is that, when doing the stemming at search time, each derived
term will have its own frequency, and the results are going to be biased
towards those that occur less often (which is not desired because the user
did not explicitely search for them).

Maybe I don't understand the issue and this is not a problem ? Else would
there be a way so that the aggregate term frequency is used for each of the
derived terms ?

Or should I go back to performing stemming during indexing ?

Cheers,
J.F. Dockes

-- 
Recoll: desktop search for Unix. http://www.recoll.org

Olly Betts

2005-Nov-16 21:51 UTC

head link

[Xapian-discuss] query time stemming and term weights

On Wed, Nov 16, 2005 at 08:14:31PM +0100, Jean-Francois Dockes
wrote:> The problem is with term frequencies. When doing the stemming at index
> time, the term frequency will be for the stem, more or less the sum of
derived
> terms frequencies.
Strictly speaking "same or less" rather than "more or
less"...
> My concern is that, when doing the stemming at search time, each derived
> term will have its own frequency, and the results are going to be biased
> towards those that occur less often (which is not desired because the user
> did not explicitely search for them).
> 
> Maybe I don't understand the issue and this is not a problem ?
The derivation of the probabilistic weights (which are the ones which
you get when you use TradWeight) assumes that terms are independent.
You have to really or the maths just becomes intractable.

Now even in general, this assumption isn't exactly true, but it's
particularly shakey for terms which have have the same stem.

So that perhaps suggests that it's better to use the frequency of the
stem as you suggest.  Or at least that you are right to consider the
issue!

The pragmatic view is that the derivation of the probabilistic weights
is just giving you a plausible candidate weighting formula to compare to
other approaches - the reason it is a good formula to use isn't really
because of the maths behind in, but because in it performs well in
practical trials.  A theoretical justification is reassuring, but
doesn't make up for poor actual results!

BM25 weights (the default in Xapian) build upon the traditional
probabilistic weighting formula in ways which do have some theoretical
justification, but it seems to me that the guiding principle in the
progression from the original probabilistic formula through BM11 and
BM15 to BM25 is "gives better results".

For example, everybody now ignores the constant c, giving a power to
which f and K are used, which featured in the original BM25 - Stephen
Robertson et al remarked in an early paper that powers other than 1 were
"not helpful" (section 3.2, page 3):

http://trec.nist.gov/pubs/trec3/papers/city.ps.gz

So my suggestion would be to do some tests and see if retrieval
effectiveness is actually made better/worse or left unchanged by
stemming at search vs index time.  I'd definitely be interested to hear
the results of any such tests.
> Else would there be a way so that the aggregate term frequency is used
> for each of the derived terms ?
Not directly I think - if it's useful, a "SynonymPostList"
isn't too
hard to write.  You can probably even use the "correct" combined term
frequency by generating terms from the stems and using those to get
the term frequency.  Or just use an estimated term frequency based on
probabilities:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=50

Longer term, I'm interested in the idea of stemming at search time
(at least as an option).  It has several benefits such as allowing an
exact word search without having to index "raw" terms too, and
allowing
choice of stemming language at search time.

Cheers,
    Olly

Maybe Matching Threads

Search for more possibly parallel threads

Xapian discuss - Nov 2005 - query time stemming and term weights

[Xapian-discuss] query time stemming and term weights

[Xapian-discuss] query time stemming and term weights

Maybe Matching Threads