thr3ads.net - Xapian discuss - Search requests should ignore accents (C++ API)? [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Kim Walisch

2018-Jul-25 14:33 UTC

Search requests should ignore accents (C++ API)?

Hi,

I am using libxapian in a C++ project (hence I am using Xapian's C++ API)
and some user has requested that search requests should ignore accents.
E.g. when the user searches for "Herr Müller" he expects that
"Herr Muller"
is also a search hit.

Is this possible in Xapian?
Do you have any links to the documentation of that feature?

Thanks for your help,
Kim

Olly Betts

2018-Jul-25 22:42 UTC

head link

Search requests should ignore accents (C++ API)?

On Wed, Jul 25, 2018 at 04:33:58PM +0200, Kim Walisch
wrote:> I am using libxapian in a C++ project (hence I am using Xapian's C++
API)
> and some user has requested that search requests should ignore accents.
> E.g. when the user searches for "Herr Müller" he expects that
"Herr Muller"
> is also a search hit.
Simply stripping accents can be a reasonable normalisation in some
languages, but it's not always appropriate.  The most obvious problem
is for languages where there are words with different meanings which
differ in spelling only by their accents.

There are also languages where you don't just drop the accent if
you aren't able to write it.  German is actually an example of that -
you'd write "Mueller" if you weren't able to write the accent,
not
"Muller".
> Is this possible in Xapian?
Some of the stemmers normalise accents in some or all cases.  That only
helps when the stemmed form is being matched though, and is less useful
for real names.
> Do you have any links to the documentation of that feature?
People often use https://www.nongnu.org/unac/ for this, though it
doesn't look like it's very actively maintained (or else there's a
new home page for it I couldn't trivially find).  That seems to
just drop the umlaut though.

There's also g_str_to_ascii () in glib:

https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-str-to-ascii

That seems to be a bit too heavy a hammer for normalising text for
search though as you really don't want pure ASCII out in every case
(particularly for languages which don't use the Latin alphabet).

ICU is another option - there's an example of removing accents by
decomposing, removing non-spacing marks, then recomposing here:

http://userguide.icu-project.org/transforms/general

But again, that would just drop the umlaut.

Cheers,
    Olly

Kim Walisch

2018-Jul-26 06:41 UTC

head link

Search requests should ignore accents (C++ API)?

Thanks for your detailed answer!

Kim

On Thu, Jul 26, 2018 at 12:42 AM Olly Betts <olly at survex.com> wrote:
> On Wed, Jul 25, 2018 at 04:33:58PM +0200, Kim Walisch wrote:
> > I am using libxapian in a C++ project (hence I am using Xapian's
C++ API)
> > and some user has requested that search requests should ignore
accents.
> > E.g. when the user searches for "Herr Müller" he expects
that "Herr
> Muller"
> > is also a search hit.
>
> Simply stripping accents can be a reasonable normalisation in some
> languages, but it's not always appropriate.  The most obvious problem
> is for languages where there are words with different meanings which
> differ in spelling only by their accents.
>
> There are also languages where you don't just drop the accent if
> you aren't able to write it.  German is actually an example of that -
> you'd write "Mueller" if you weren't able to write the
accent, not
> "Muller".
>
> > Is this possible in Xapian?
>
> Some of the stemmers normalise accents in some or all cases.  That only
> helps when the stemmed form is being matched though, and is less useful
> for real names.
>
> > Do you have any links to the documentation of that feature?
>
> People often use https://www.nongnu.org/unac/ for this, though it
> doesn't look like it's very actively maintained (or else
there's a
> new home page for it I couldn't trivially find).  That seems to
> just drop the umlaut though.
>
> There's also g_str_to_ascii () in glib:
>
>
>
https://developer.gnome.org/glib/stable/glib-String-Utility-Functions.html#g-str-to-ascii
>
> That seems to be a bit too heavy a hammer for normalising text for
> search though as you really don't want pure ASCII out in every case
> (particularly for languages which don't use the Latin alphabet).
>
> ICU is another option - there's an example of removing accents by
> decomposing, removing non-spacing marks, then recomposing here:
>
> http://userguide.icu-project.org/transforms/general
>
> But again, that would just drop the umlaut.
>
> Cheers,
>     Olly
>

Seemingly Similar Threads

Search for more apparently analagous threads

Xapian discuss - Jul 2018 - Search requests should ignore accents (C++ API)?

Search requests should ignore accents (C++ API)?

Search requests should ignore accents (C++ API)?

Search requests should ignore accents (C++ API)?

Seemingly Similar Threads