I was inspired by following exchange, but because topic drifted, I started new thread: Olly Betts wrote: > On Wed, Mar 29, 2006 at 04:14:50PM +0100, James Aylett wrote: >> Stemming in general is actually harmful. > > That's a bit strong. > > TREC tests and the like provide a lot of evidence that stemming improves > retrieval. It's true that it can be harmful in cases when words that are > unrelated (or not closely related enough) get conflated, but then *NOT* > conflating words is also harmful in many cases and on balance stemming is > a win. I am interested in one special kind of stemming. Say, my user queries for "protein". Document might say "non-protein". Will xapian match it? Is it possible to disable such matches? Sorry I still don't have omega running (reasons explained in my next email question). -- Peter Masiar, Yale center for medical Informatics A: Because it messes up the flow of reading. Q: Why is top-posting often frowned upon?
On Wed, Mar 29, 2006 at 12:18:45PM -0500, Peter Masiar wrote:> Say, my user queries for "protein". Document might say "non-protein". > Will xapian match it? Is it possible to disable such matches?Currently (I believe - Olly may need to correct me) what will happen is that both "non" and "protein" will be generated as terms (well, they'll be stemmed too), but someone searching for "non-protein" will generate a PHRASE search "non" PHRASE(n) "protein" where n is something appropriate (probably 2?). So searching for "protein" will find anything containing "non-protein", which isn't always what you want. (Probably isn't very often what you want.) What you probably would need if you wanted to avoid this would be to generate "non-protein" as a term. ("protein" stemmed is still "protein" in our English stemmer.) J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org