Emmanuel Engelhart
2022-Aug-13 15:13 UTC
Searching in multiple databases with different steeming
Hi These days we discover the challenges and joys of searching things in multiple databases. Xapian is great because it allows to do so in one run, which means it recombines the results by itself. In our case, our indexes can be multi-language and steemed accordingly. It means, we have for example one index in French and an other one in English. Unfortunately, the search request can either be steemed in English or in French. Which means that if an unsteemed pattern might exist in both languages (French and English share a lot of words), the steemed version will be for one language only and therefore we will get results for one database. For the moment, this is not really clear how we should deal with this problem/limitation. Any idea? Would that be possible to merge properly two Msets (resulting of two search requests)? Regards Kelson -- Kiwix - Wikipedia Offline & more * Web: https://kiwix.org/ * Twitter: https://twitter.com/KiwixOffline * Wiki: https://wiki.kiwix.org/
On Sat, Aug 13, 2022 at 05:13:56PM +0200, Emmanuel Engelhart wrote:> Unfortunately, the search request can either be steemed in English or in > French. Which means that if an unsteemed pattern might exist in both > languages (French and English share a lot of words), the steemed version > will be for one language only and therefore we will get results for one > database. > > For the moment, this is not really clear how we should deal with this > problem/limitation. Any idea? Would that be possible to merge properly two > Msets (resulting of two search requests)?If you're using a stemming strategy which indexes unstemmed terms as well (which you probably are - it's the default, and if you don't then exact phrase searches aren't possible) then one option is to disable stemming when searching such combinations of databases. You lose the benefits of stemming, but also avoid issues where e.g. the English stemmer creates an undesirable false match against an unrelated word stemmed by the French stemmer to the same combination of characters. Or you can include the stemmer language in the prefix added to stemmed terms, and then parse the query with each stemmer and combine with OP_OR - the query optimiser will see that none of the stemmed English terms are present in the French database and cull the useless part of the query early on, so effectively you end up just running one version of the query on each database. Or you can add a `Lfr` term to every document in the French database and `Len` to every document in the English one, and search for: Query(OP_FILTER, query_parsed_with_en_stemmer, Query("Len")) | Query(OP_FILTER, query_parsed_with_fr_stemmer, Query("Lfr")) Again the query optimiser should simplify that to just run each version of the parsed query by itself on the appropriate database. The development version contains code to merge MSet objects which is used for combining results from remotes, but 1.4 uses a different approach for that and doesn't have such code. It's not a public API, but perhaps could be. The tricky part of using it properly is that the weights need to be scaled to be compatible for it to give correct results (internally that's done for you for remote searches). Cheers, Olly