thr3ads.net - Xapian devel - Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Robert Stepanek

2024-Jan-08 13:01 UTC

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Sun, Jan 7, 2024, at 7:45 PM, Olly Betts wrote:> I've restarted trac.
I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I
create a trac issue, too?
> Assuming the latter is valid, just removing this block (or removing the
> parts of it which are Lu or Ll) should fix the problem as then
> tokenisation will switch mode - I tried this and it fixes your case at
> least:
Removing the whole block will cause word-breaker to not correctly handle
halfwidth Katakana, such as "??????????" which it would treat as a
single term, whereas it should be two: ??????and  ????).

My pull request causes word-breaker to only handle halfwidth Katakana and Hangul
codepoints as unbroken script and treats Latin characters, numbers, symbols and
punctuation as broken script. There's a couple of unit tests that check for
this.

diff --git a/xapian-core/queryparser/word-breaker.cc
b/xapian-core/queryparser/word-breaker.cc
index 8108523ccd53..6122dcdccc97 100644
--- a/xapian-core/queryparser/word-breaker.cc
+++ b/xapian-core/queryparser/word-breaker.cc
@@ -102,8 +102,10 @@ is_unbroken_script(unsigned p)
        0xF900 - 1, 0xFAFF,
        // FE30..FE4F; CJK Compatibility Forms
        0xFE30 - 1, 0xFE4F,
-       // FF00..FFEF; Halfwidth and Fullwidth Forms
-       0xFF00 - 1, 0xFFEF,
+       // FF00..FF60: Fullwidth Numbers, Latin Characters, Punctuation
+       // FF61..FF64: Halfwidth Punctuation
+       0xFF65 - 1, 0xFFDC, // Halfwidth Katakana and Hangul
+       // FFE0..FFEF; Fullwidth and Halfwidth Symbols

The fullwidth "????? ??????" tests suggests to me that either Xapian
should allow for Unicode normalization, or application developers must take care
of this before indexing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20240108/a387d8c2/attachment.htm>

Olly Betts

2024-Jan-09 02:28 UTC

head link

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Mon, Jan 08, 2024 at 02:01:46PM +0100, Robert Stepanek
wrote:> Removing the whole block will cause word-breaker to not correctly
> handle halfwidth Katakana, such as "??????????" which it would
treat
> as a single term, whereas it should be two: ??????and  ????).
> 
> My pull request causes word-breaker to only handle halfwidth Katakana
> and Hangul codepoints as unbroken script and treats Latin characters,
> numbers, symbols and punctuation as broken script. There's a couple of
> unit tests that check for this.
Thanks, that looks good - now merged.

I think we probably should backport this to 1.4 - it's a behaviour
change, but limited to text containing these fullwidth latin characters
and the change fixes a bug.  The awkward wrinkle is that you need to
reindex to get the full benefits of the fix.

Did you already check the other ranges for cased letters?  I can but if
you have already there's not much point.
> The fullwidth "????? ??????" tests suggests to me that
> either Xapian should allow for Unicode normalization, or application
> developers must take care of this before indexing.
We currently leave it to the API user to normalise Unicode
representation, though maybe we should provide support for doing so.

Cheers,
    Olly

Possibly Parallel Threads

Search for more apparently analagous threads

Xapian devel - Jan 2024 - Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possibly Parallel Threads