thr3ads.net - Xapian devel - Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2024-Jan-09 02:28 UTC

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Mon, Jan 08, 2024 at 02:01:46PM +0100, Robert Stepanek
wrote:> Removing the whole block will cause word-breaker to not correctly
> handle halfwidth Katakana, such as "??????????" which it would
treat
> as a single term, whereas it should be two: ??????and  ????).
> 
> My pull request causes word-breaker to only handle halfwidth Katakana
> and Hangul codepoints as unbroken script and treats Latin characters,
> numbers, symbols and punctuation as broken script. There's a couple of
> unit tests that check for this.
Thanks, that looks good - now merged.

I think we probably should backport this to 1.4 - it's a behaviour
change, but limited to text containing these fullwidth latin characters
and the change fixes a bug.  The awkward wrinkle is that you need to
reindex to get the full benefits of the fix.

Did you already check the other ranges for cased letters?  I can but if
you have already there's not much point.
> The fullwidth "????? ??????" tests suggests to me that
> either Xapian should allow for Unicode normalization, or application
> developers must take care of this before indexing.
We currently leave it to the API user to normalise Unicode
representation, though maybe we should provide support for doing so.

Cheers,
    Olly

Robert Stepanek

2024-Jan-10 08:02 UTC

head link

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Tue, Jan 9, 2024, at 3:28 AM, Olly Betts wrote:> Thanks, that looks good - now merged.
Thanks!
> Did you already check the other ranges for cased letters?  I can but if
> you have already there's not much point.
I did not. If you find time, that'd be great. Otherwise I can make room for
it in the next days.
> > The fullwidth "????? ??????" tests suggests to me that
> > either Xapian should allow for Unicode normalization, or application
> > developers must take care of this before indexing.
> 
> We currently leave it to the API user to normalise Unicode
> representation, though maybe we should provide support for doing so.
Thinking some more about this, I think it's sane to leave this out of
Xapian. Unless there is also some bookkeeping added within Xapian to tell which
normalisation was applied to terms, which can get complex for sub-databases or
mixed normalisations within one database.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20240110/a46af694/attachment.htm>

Maybe Matching Threads

Search for more maybe matching threads

Xapian devel - Jan 2024 - Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Maybe Matching Threads