On Tue, Feb 13, 2018, at 02:32, Peter Zhao wrote:> At 2018-02-12 20:00:02, xapian-discuss-request at lists.xapian.org wrote:
> >There's also a patch to add support for using libicu to find word
> >boundaries:
> >
> >https://github.com/xapian/xapian/pull/114
> >
> >That'll get merged soon hopefully (mostly we need to sort out how
to
> >manage the libicu dependency - do we make it a hard dependency, or an
> >option for how to build xapian-core, etc) but if you're happy to
build
> >xapian-core from source please try it and give feedback on how well
> >it works.
We are running the CJK word boundary segmentation patch at FastMail since over a
year in production and are happy with it. That being said, I just realised that
the PR does not cleanly merge with the latest Xapian upstream branch. I'll
fix the merge conflicts and push an update to the pull request tomorrow.
BTW: For a quick glance at how ICU segments arbitrary CJK text, I wrote a small
wrapper around libicu and expose it as a web tool: https://cjkwords.com/
Cheers,
Robert