thr3ads.net - Xapian devel - Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2024-Jan-07 18:45 UTC

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Thu, Jan 04, 2024 at 05:50:22PM +0100, Robert Stepanek
wrote:> Since I am undecided yet if and how to fix this in Xapian I haven't
> come up with a pull request. Because trac currently is offline, I
> could not file a bug. I hope it's OK to post my analysis here first,
> I'll be happy to follow up reporting that bug proper later (should we
> conclude that it actually is a bug).
I've restarted trac.
> Still, I do think it is a bug for Xapian not to return a result when
> querying for a term that's verbatim in the original input and the
> database. Should you agree I will be happy to discuss how to fix this
> and might come up with a pull request once we agreed on a solution.
I think the cause is that the list of ranges of characters which need
word breaks findings includes this block of full- and half-width latin
forms, coupled with an assumption that there's no lowercase vs uppercase
forms in these alphabets.

Assuming the latter is valid, just removing this block (or removing the
parts of it which are Lu or Ll) should fix the problem as then
tokenisation will switch mode - I tried this and it fixes your case at
least:

diff --git a/xapian-core/queryparser/word-breaker.cc
b/xapian-core/queryparser/word-breaker.cc
index 8108523ccd53..4fabc23f4b56 100644
--- a/xapian-core/queryparser/word-breaker.cc
+++ b/xapian-core/queryparser/word-breaker.cc
@@ -103,7 +103,7 @@ is_unbroken_script(unsigned p)
 	// FE30..FE4F; CJK Compatibility Forms
 	0xFE30 - 1, 0xFE4F,
 	// FF00..FFEF; Halfwidth and Fullwidth Forms
-	0xFF00 - 1, 0xFFEF,
+	//0xFF00 - 1, 0xFFEF,
 	// 1AFF0..1AFFF; Kana Extended-B
 	// 1B000..1B0FF; Kana Supplement
 	// 1B100..1B12F; Kana Extended-A

If we're fixing it this way we should check this list for other
instances of this (and doing this would probably reveal if that
assumption is valid).

Cheers,
    Olly

Robert Stepanek

2024-Jan-08 13:01 UTC

head link

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

On Sun, Jan 7, 2024, at 7:45 PM, Olly Betts wrote:> I've restarted trac.
I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I
create a trac issue, too?
> Assuming the latter is valid, just removing this block (or removing the
> parts of it which are Lu or Ll) should fix the problem as then
> tokenisation will switch mode - I tried this and it fixes your case at
> least:
Removing the whole block will cause word-breaker to not correctly handle
halfwidth Katakana, such as "??????????" which it would treat as a
single term, whereas it should be two: ??????and  ????).

My pull request causes word-breaker to only handle halfwidth Katakana and Hangul
codepoints as unbroken script and treats Latin characters, numbers, symbols and
punctuation as broken script. There's a couple of unit tests that check for
this.

diff --git a/xapian-core/queryparser/word-breaker.cc
b/xapian-core/queryparser/word-breaker.cc
index 8108523ccd53..6122dcdccc97 100644
--- a/xapian-core/queryparser/word-breaker.cc
+++ b/xapian-core/queryparser/word-breaker.cc
@@ -102,8 +102,10 @@ is_unbroken_script(unsigned p)
        0xF900 - 1, 0xFAFF,
        // FE30..FE4F; CJK Compatibility Forms
        0xFE30 - 1, 0xFE4F,
-       // FF00..FFEF; Halfwidth and Fullwidth Forms
-       0xFF00 - 1, 0xFFEF,
+       // FF00..FF60: Fullwidth Numbers, Latin Characters, Punctuation
+       // FF61..FF64: Halfwidth Punctuation
+       0xFF65 - 1, 0xFFDC, // Halfwidth Katakana and Hangul
+       // FFE0..FFEF; Fullwidth and Halfwidth Symbols

The fullwidth "????? ??????" tests suggests to me that either Xapian
should allow for Unicode normalization, or application developers must take care
of this before indexing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20240108/a387d8c2/attachment.htm>

Maybe Matching Threads

Search for more maybe matching threads

Xapian devel - Jan 2024 - Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

Maybe Matching Threads