Olly Betts
2024-Jan-09 02:28 UTC
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
On Mon, Jan 08, 2024 at 02:01:46PM +0100, Robert Stepanek wrote:> Removing the whole block will cause word-breaker to not correctly > handle halfwidth Katakana, such as "??????????" which it would treat > as a single term, whereas it should be two: ??????and ????). > > My pull request causes word-breaker to only handle halfwidth Katakana > and Hangul codepoints as unbroken script and treats Latin characters, > numbers, symbols and punctuation as broken script. There's a couple of > unit tests that check for this.Thanks, that looks good - now merged. I think we probably should backport this to 1.4 - it's a behaviour change, but limited to text containing these fullwidth latin characters and the change fixes a bug. The awkward wrinkle is that you need to reindex to get the full benefits of the fix. Did you already check the other ranges for cased letters? I can but if you have already there's not much point.> The fullwidth "????? ??????" tests suggests to me that > either Xapian should allow for Unicode normalization, or application > developers must take care of this before indexing.We currently leave it to the API user to normalise Unicode representation, though maybe we should provide support for doing so. Cheers, Olly
Robert Stepanek
2024-Jan-10 08:02 UTC
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
On Tue, Jan 9, 2024, at 3:28 AM, Olly Betts wrote:> Thanks, that looks good - now merged.Thanks!> Did you already check the other ranges for cased letters? I can but if > you have already there's not much point.I did not. If you find time, that'd be great. Otherwise I can make room for it in the next days.> > The fullwidth "????? ??????" tests suggests to me that > > either Xapian should allow for Unicode normalization, or application > > developers must take care of this before indexing. > > We currently leave it to the API user to normalise Unicode > representation, though maybe we should provide support for doing so.Thinking some more about this, I think it's sane to leave this out of Xapian. Unless there is also some bookkeeping added within Xapian to tell which normalisation was applied to terms, which can get complex for sub-databases or mixed normalisations within one database. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20240110/a46af694/attachment.htm>
Seemingly Similar Threads
- Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
- Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
- Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
- Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
- Pull requests: CJK words and Snippet generator