Hi, I am using libxapian in a C++ project (hence I am using Xapian's C++ API) and some user has requested that search requests should ignore accents. E.g. when the user searches for "Herr Müller" he expects that "Herr Muller" is also a search hit. Is this possible in Xapian? Do you have any links to the documentation of that feature? Thanks for your help, Kim
On Wed, Jul 25, 2018 at 04:33:58PM +0200, Kim Walisch wrote:> I am using libxapian in a C++ project (hence I am using Xapian's C++ API) > and some user has requested that search requests should ignore accents. > E.g. when the user searches for "Herr Müller" he expects that "Herr Muller" > is also a search hit.Simply stripping accents can be a reasonable normalisation in some languages, but it's not always appropriate. The most obvious problem is for languages where there are words with different meanings which differ in spelling only by their accents. There are also languages where you don't just drop the accent if you aren't able to write it. German is actually an example of that - you'd write "Mueller" if you weren't able to write the accent, not "Muller".> Is this possible in Xapian?Some of the stemmers normalise accents in some or all cases. That only helps when the stemmed form is being matched though, and is less useful for real names.> Do you have any links to the documentation of that feature?People often use https://www.nongnu.org/unac/ for this, though it doesn't look like it's very actively maintained (or else there's a new home page for it I couldn't trivially find). That seems to just drop the umlaut though. There's also g_str_to_ascii () in glib: https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-str-to-ascii That seems to be a bit too heavy a hammer for normalising text for search though as you really don't want pure ASCII out in every case (particularly for languages which don't use the Latin alphabet). ICU is another option - there's an example of removing accents by decomposing, removing non-spacing marks, then recomposing here: http://userguide.icu-project.org/transforms/general But again, that would just drop the umlaut. Cheers, Olly
Thanks for your detailed answer! Kim On Thu, Jul 26, 2018 at 12:42 AM Olly Betts <olly at survex.com> wrote:> On Wed, Jul 25, 2018 at 04:33:58PM +0200, Kim Walisch wrote: > > I am using libxapian in a C++ project (hence I am using Xapian's C++ API) > > and some user has requested that search requests should ignore accents. > > E.g. when the user searches for "Herr Müller" he expects that "Herr > Muller" > > is also a search hit. > > Simply stripping accents can be a reasonable normalisation in some > languages, but it's not always appropriate. The most obvious problem > is for languages where there are words with different meanings which > differ in spelling only by their accents. > > There are also languages where you don't just drop the accent if > you aren't able to write it. German is actually an example of that - > you'd write "Mueller" if you weren't able to write the accent, not > "Muller". > > > Is this possible in Xapian? > > Some of the stemmers normalise accents in some or all cases. That only > helps when the stemmed form is being matched though, and is less useful > for real names. > > > Do you have any links to the documentation of that feature? > > People often use https://www.nongnu.org/unac/ for this, though it > doesn't look like it's very actively maintained (or else there's a > new home page for it I couldn't trivially find). That seems to > just drop the umlaut though. > > There's also g_str_to_ascii () in glib: > > > https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-str-to-ascii > > That seems to be a bit too heavy a hammer for normalising text for > search though as you really don't want pure ASCII out in every case > (particularly for languages which don't use the Latin alphabet). > > ICU is another option - there's an example of removing accents by > decomposing, removing non-spacing marks, then recomposing here: > > http://userguide.icu-project.org/transforms/general > > But again, that would just drop the umlaut. > > Cheers, > Olly >