thr3ads.net - dovecot - FTS Tokenization filters normalizer-icu vs lowercase [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Alessio Cecchi

2022-Jan-20 16:20 UTC

FTS Tokenization filters normalizer-icu vs lowercase

Hi,

I'm trying to setup fts-flatcurve with tokenization.

What are the differences/benefits with "fts_filters = normalizer-icu"
vs
"fts_filters = lowercase"?

Reading the Doc I found about normalizer-icu "This is potentially very 
resource intensive." and about lowercase "Supports UTF8, when compiled
with libicu".

So, using lowercase is almost the same that normalizer-icu but faster?

FYI

for using fts-flatcurve with dovecot RPM packages from repo.dovecot.org 
you have to rebuild with --with-icu --with-stemmer --with-textcat and 
related library.

Thanks

-- 
Alessio Cecchi
Postmaster @ http://www.qboxmail.it
https://www.linkedin.com/in/alessice

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20220120/fe7e6c9f/attachment.htm>

Michael Slusarz

2022-Jan-20 17:36 UTC

head link

FTS Tokenization filters normalizer-icu vs lowercase

> On 01/20/2022 9:20 AM Alessio Cecchi <alessio at skye.it> wrote:
> 
> I'm trying to setup fts-flatcurve with tokenization.
> 
> What are the differences/benefits with "fts_filters =
normalizer-icu" vs "fts_filters = lowercase"?
> 
> Reading the Doc I found about normalizer-icu "This is potentially very
resource intensive." and about lowercase "Supports UTF8, when compiled
with libicu".
> 
> So, using lowercase is almost the same that normalizer-icu but faster?
> No, these are 2 different actions.

Lowercase tries to use language rules to map characters to a
"lowercase" equivalent, which is character/language dependent.

Normalization tries to take a string and reduce it to a unique, normalized form,
that can be directly compared to other normalized strings.  UTF, for example,
can have strings that display the same to the user but contain very different
byte data.  For example, it is possible to create more complicated glyphs by
either using a specific code-point (i.e., a 4 byte UTF element) or by using a
combination of UTF sequences that, when combined, create an identical display of
the character.

Normalization is a very complicated topic. 
https://en.wikipedia.org/wiki/Unicode_equivalence might help with further
understanding.

The ICU library deals with general internationalization support, and these two
filters are using different parts of that library to do different things.  They
are not replacements for each other, they are complimentary - you could
normalize a string and then lowercase it, for example.

michael

> 
> 
> FYI
> 
> for using fts-flatcurve with dovecot RPM packages from repo.dovecot.org you
have to rebuild with --with-icu --with-stemmer --with-textcat and related
library.
> 
> Thanks
> 
> --
> Alessio Cecchi
> Postmaster @ http://www.qboxmail.it
> https://www.linkedin.com/in/alessice
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<https://dovecot.org/pipermail/dovecot/attachments/20220120/b3af1576/attachment-0001.htm>

dovecot - Jan 2022 - FTS Tokenization filters normalizer-icu vs lowercase

FTS Tokenization filters normalizer-icu vs lowercase

FTS Tokenization filters normalizer-icu vs lowercase