Jean-Francois Dockes
2006-Feb-09 13:22 UTC
[Xapian-discuss] Re: Xapian-discuss Digest, Vol 21, Issue 11
Olly Betts wrote: > Alternatively, you could stem nothing at index time and then for search > terms which you want to stem, stem them, and then run them through an > "unstemming" algorithm to produce a list of terms they could have come > from. Then OR this list together. Unfortunately nobody has written > the "unstemmer" yet. Also this means more work at search time than > the first approach, but that may not really matter. I've not tried > the idea, so I can't say for sure. What I am doing in Recoll is use an auxiliary database linking a stem with the set of terms from the corpus that reduce to this stem. The unstemming algorithm is then just to select said set from stem value. You don't need a general unstemming algorithm, you just want the expansion to the terms that actually exist in the original documents. So, yes, the "unstemming" algorithm has been written and can be found on http://www.recoll.org :) There is a pass on the "all terms" list at the end of indexation to compute the unstemming database (always recreated from scratch, maybe this could be improved). The time to create the unstemming database is not high, but this might be different if the original db was huge, I guess. My test db is around 600Mb, and I think that the time is much less than 1 mn. The stemming database size is negligible against the main one (around 1/2 percent). JF