Jean-Francois Dockes
2005-Nov-16 19:15 UTC
[Xapian-discuss] query time stemming and term weights
I am developping a personal/desktop search tool for which I am experimenting with doing no stemming during the indexing, but instead having a stem database (or several for different languages), used for expanding the query terms at search time. (ie: user query: flooring -> stem: floor -> final query for: [floored flooring floorings floors]) I have thought of a possible problem with weighting when using this approach, I am not really confident in my knowledge of how things are computed, so I am not sure that this is an actual issue. The problem is with term frequencies. When doing the stemming at index time, the term frequency will be for the stem, more or less the sum of derived terms frequencies. My concern is that, when doing the stemming at search time, each derived term will have its own frequency, and the results are going to be biased towards those that occur less often (which is not desired because the user did not explicitely search for them). Maybe I don't understand the issue and this is not a problem ? Else would there be a way so that the aggregate term frequency is used for each of the derived terms ? Or should I go back to performing stemming during indexing ? Cheers, J.F. Dockes -- Recoll: desktop search for Unix. http://www.recoll.org
On Wed, Nov 16, 2005 at 08:14:31PM +0100, Jean-Francois Dockes wrote:> The problem is with term frequencies. When doing the stemming at index > time, the term frequency will be for the stem, more or less the sum of derived > terms frequencies.Strictly speaking "same or less" rather than "more or less"...> My concern is that, when doing the stemming at search time, each derived > term will have its own frequency, and the results are going to be biased > towards those that occur less often (which is not desired because the user > did not explicitely search for them). > > Maybe I don't understand the issue and this is not a problem ?The derivation of the probabilistic weights (which are the ones which you get when you use TradWeight) assumes that terms are independent. You have to really or the maths just becomes intractable. Now even in general, this assumption isn't exactly true, but it's particularly shakey for terms which have have the same stem. So that perhaps suggests that it's better to use the frequency of the stem as you suggest. Or at least that you are right to consider the issue! The pragmatic view is that the derivation of the probabilistic weights is just giving you a plausible candidate weighting formula to compare to other approaches - the reason it is a good formula to use isn't really because of the maths behind in, but because in it performs well in practical trials. A theoretical justification is reassuring, but doesn't make up for poor actual results! BM25 weights (the default in Xapian) build upon the traditional probabilistic weighting formula in ways which do have some theoretical justification, but it seems to me that the guiding principle in the progression from the original probabilistic formula through BM11 and BM15 to BM25 is "gives better results". For example, everybody now ignores the constant c, giving a power to which f and K are used, which featured in the original BM25 - Stephen Robertson et al remarked in an early paper that powers other than 1 were "not helpful" (section 3.2, page 3): http://trec.nist.gov/pubs/trec3/papers/city.ps.gz So my suggestion would be to do some tests and see if retrieval effectiveness is actually made better/worse or left unchanged by stemming at search vs index time. I'd definitely be interested to hear the results of any such tests.> Else would there be a way so that the aggregate term frequency is used > for each of the derived terms ?Not directly I think - if it's useful, a "SynonymPostList" isn't too hard to write. You can probably even use the "correct" combined term frequency by generating terms from the stems and using those to get the term frequency. Or just use an estimated term frequency based on probabilities: http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=50 Longer term, I'm interested in the idea of stemming at search time (at least as an option). It has several benefits such as allowing an exact word search without having to index "raw" terms too, and allowing choice of stemming language at search time. Cheers, Olly
Possibly Parallel Threads
- Introduction and Doubts
- choosing between probabilistic and boolean prefixes for terms
- Search Algorithm Used for Keyword Search
- GSOC : Language Modelling for information retrieval with Diversified Search results
- Weighting the author of a doc when that term can also appear as a frequent term in other docs