Sean McCleary
2009-Jul-19 15:22 UTC
[Xapian-discuss] Stemming, stopping, and multiple languages
Hello all, I'm just getting started with Xapian and have a question about stemming, stop words, and multiple languages. So I have two Xapian databases, one containing documents in English, and another containing documents in German. When I index them, I don't use a stemmer or a stop words, as I've read that it's considered best practice to apply a stemmer and stop words at the time of searching, not indexing. So when I'm searching one database at a time, it's easy. Load a stemmer for the appropriate language, load the stop words. When I want to search through both at once, I can easily load both databases. But it seems that the stemmer and stop words are applied to to the query, not the databases. So if I had, for example, the word "die" (which means "the") in my list of German stop words, it would also exclude the word "die" (as in, "cease to be alive") from any English documents as well, right? The same problem applies to the stemmer -- I can only load one for one of the languages. Is there any way around this? Or does this mean I need to apply stemmers and stop words at the time of indexing to get this to work? Thanks for any advice, Sean
John Leach
2009-Jul-19 23:26 UTC
[Xapian-discuss] Stemming, stopping, and multiple languages
On Sun, 2009-07-19 at 17:22 +0200, Sean McCleary wrote:> Hello all, I'm just getting started with Xapian and have a question > about stemming, stop words, and multiple languages.I've just been thinking mysef how to do this recently, so I'll try and help. I'm not that familiar with the internals of Xapian yet, so some details might not be totally accurate.> So I have two Xapian databases, one containing documents in English, > and another containing documents in German. When I index them, I > don't use a stemmer or a stop words, as I've read that it's considered > best practice to apply a stemmer and stop words at the time of > searching, not indexing.Stemming only usually works if you do it both when indexing and searching. If you stem just when searching, then you'll be searching for terms that do not exist in the database (the database itself knows nothing (well, very little) about stemming - all the magic is done at the tokenizing stage by the term generator) e.g: with no stemming at index time, the term "fishing" is stored in the database as it is. When you conduct a stemmed search, a query of "fishing" will be stemmed to "fish" which will not match the document. Though actually, with Xapian, the term generator returns both the stemmed and unstemmed terms, so you might not have noticed the broken stemming in your case, unless you were testing carefully.> So when I'm searching one database at a time, it's easy. Load a > stemmer for the appropriate language, load the stop words. > > When I want to search through both at once, I can easily load both > databases. But it seems that the stemmer and stop words are applied > to to the query, not the databases.Yes, you give the QueryParser the stemmer and stoppers, not the databases. The stemming and stopping is done on the query and the resulting query is executed on the databases Another clue is that you set-up multiple databases for search with the add_database function on a Database object. That function doesn't provide a way to give a stemmer/stopper at the same time.> So if I had, for example, the > word "die" (which means "the") in my list of German stop words, it > would also exclude the word "die" (as in, "cease to be alive") from > any English documents as well, right? The same problem applies to the > stemmer -- I can only load one for one of the languages.As I just learnt yesterday, the stop words are actually still tokenized, they're just not stemmed! So in this particular case, a search on die would be ok. But that is not really your point :)> Is there any way around this? Or does this mean I need to apply > stemmers and stop words at the time of indexing to get this to work?I think you could do this with a custom stemmer class, that stems a query for more than one language at a time (I'm assuming it's possible to return more than one stem to the Term Generator - if not, I guess you'd need a custom TermGenerator instead, that could be given multiple stemmers). As for stoppers, with the current behaviour it's fine as it only affects stemming (so your custom Stemmer/TermGenerator would be given appropriate stoppers too). If you wanted to fully remove stop words from the query though, that would be more complicated - I think you'd have to know what stop words from one language are other words in another and not stop them when searching databases in both those languages, erk! John. http://johnleach.co.uk