thr3ads.net - Xapian discuss - Searching in multiple databases with different steeming [Aug 2022]

If this information is useful, please help other people find it:
Share via:

Emmanuel Engelhart

2022-Aug-13 15:13 UTC

Searching in multiple databases with different steeming

Hi

These days we discover the challenges and joys of searching things in 
multiple databases. Xapian is great because it allows to do so in one 
run, which means it recombines the results by itself.

In our case, our indexes can be multi-language and steemed accordingly. 
It means, we have for example one index in French and an other one in 
English.

Unfortunately, the search request can either be steemed in English or in 
French. Which means that if an unsteemed pattern might exist in both 
languages (French and English share a lot of words), the steemed version 
will be for one language only and therefore we will get results for one 
database.

For the moment, this is not really clear how we should deal with this 
problem/limitation. Any idea? Would that be possible to merge properly 
two Msets (resulting of two search requests)?

Regards
Kelson

-- 
Kiwix - Wikipedia Offline & more
* Web: https://kiwix.org/
* Twitter: https://twitter.com/KiwixOffline
* Wiki: https://wiki.kiwix.org/

Olly Betts

2022-Aug-14 23:27 UTC

head link

Searching in multiple databases with different steeming

On Sat, Aug 13, 2022 at 05:13:56PM +0200, Emmanuel Engelhart
wrote:> Unfortunately, the search request can either be steemed in English or in
> French. Which means that if an unsteemed pattern might exist in both
> languages (French and English share a lot of words), the steemed version
> will be for one language only and therefore we will get results for one
> database.
>
> For the moment, this is not really clear how we should deal with this
> problem/limitation. Any idea? Would that be possible to merge properly two
> Msets (resulting of two search requests)?
If you're using a stemming strategy which indexes unstemmed terms as
well (which you probably are - it's the default, and if you don't then
exact phrase searches aren't possible) then one option is to disable
stemming when searching such combinations of databases.

You lose the benefits of stemming, but also avoid issues where e.g. the
English stemmer creates an undesirable false match against an unrelated
word stemmed by the French stemmer to the same combination of
characters.

Or you can include the stemmer language in the prefix added to stemmed
terms, and then parse the query with each stemmer and combine with OP_OR
- the query optimiser will see that none of the stemmed English terms
are present in the French database and cull the useless part of the
query early on, so effectively you end up just running one version of
the query on each database.

Or you can add a `Lfr` term to every document in the French database
and `Len` to every document in the English one, and search for:

    Query(OP_FILTER, query_parsed_with_en_stemmer, Query("Len")) |
    Query(OP_FILTER, query_parsed_with_fr_stemmer, Query("Lfr"))

Again the query optimiser should simplify that to just run each version
of the parsed query by itself on the appropriate database.

The development version contains code to merge MSet objects which is
used for combining results from remotes, but 1.4 uses a different
approach for that and doesn't have such code.  It's not a public API,
but perhaps could be.  The tricky part of using it properly is that the
weights need to be scaled to be compatible for it to give correct
results (internally that's done for you for remote searches).

Cheers,
    Olly

Xapian discuss - Aug 2022 - Searching in multiple databases with different steeming

Searching in multiple databases with different steeming

Searching in multiple databases with different steeming