thr3ads.net - Xapian discuss - Indexing Chinese? [Oct 2018]

If this information is useful, please help other people find it:
Share via:

Eric Abrahamsen

2018-Oct-04 03:18 UTC

Indexing Chinese?

My second (and hopefully last) question: is there any more news on
indexing Chinese characters and words? Searching online mostly returns
results from a decade ago or more, with nothing very conclusive. How
close is this to possible?

For the time being I'm doing some pre-processing on long strings of
Chinese, breaking on punctuation in order to avoid errors. But I have
some large corpora of Chinese texts that in the future I'd like to index
properly.

Thanks,
Eric

Robert Stepanek

2018-Oct-04 07:27 UTC

head link

Indexing Chinese?

We are a using a fork of Xapian for this at the Cyrus IMAP project [1], using
the Unicode library word segmentation for Chinese, Japanese and Korean [2]. We
are using it at FastMail in production since about 2 years and are generally
happy with it, the search results improved over using ngrams. There's a pull
request open to merge the patch upstream [3], but it's to be decided how to
best add this to Xapian. Currently, the upstream patch doesn't build cleanly
on the master branch, but I'll look into making it compile cleanly next
week.

Cheers,
Robert

[1] https://github.com/cyrusimap/xapian
[2] http://site.icu-project.org/
[3] https://github.com/xapian/xapian/pull/114

On Thu, Oct 4, 2018, at 05:20, Eric Abrahamsen wrote:> My second (and hopefully last) question: is there any more news on
> indexing Chinese characters and words? Searching online mostly returns
> results from a decade ago or more, with nothing very conclusive. How
> close is this to possible?
> 
> For the time being I'm doing some pre-processing on long strings of
> Chinese, breaking on punctuation in order to avoid errors. But I have
> some large corpora of Chinese texts that in the future I'd like to
index
> properly.
> 
> Thanks,
> Eric
> 
>

Eric Abrahamsen

2018-Oct-04 15:31 UTC

head link

Indexing Chinese?

That's a coincidence! And very good news. I've subscribed to the PR, and
will look forward to seeing it land!

Thanks a lot,
Eric

On 10/04/18 03:27 AM, Robert Stepanek wrote:> We are a using a fork of Xapian for this at the Cyrus IMAP project
> [1], using the Unicode library word segmentation for Chinese, Japanese
> and Korean [2]. We are using it at FastMail in production since about
> 2 years and are generally happy with it, the search results improved
> over using ngrams. There's a pull request open to merge the patch
> upstream [3], but it's to be decided how to best add this to Xapian.
> Currently, the upstream patch doesn't build cleanly on the master
> branch, but I'll look into making it compile cleanly next week.
>
> Cheers,
> Robert
>
> [1] https://github.com/cyrusimap/xapian
> [2] http://site.icu-project.org/
> [3] https://github.com/xapian/xapian/pull/114
>
> On Thu, Oct 4, 2018, at 05:20, Eric Abrahamsen wrote:
>> My second (and hopefully last) question: is there any more news on
>> indexing Chinese characters and words? Searching online mostly returns
>> results from a decade ago or more, with nothing very conclusive. How
>> close is this to possible?
>> 
>> For the time being I'm doing some pre-processing on long strings of
>> Chinese, breaking on punctuation in order to avoid errors. But I have
>> some large corpora of Chinese texts that in the future I'd like to
index
>> properly.
>> 
>> Thanks,
>> Eric
>> 
>>

Maybe Matching Threads

Search for more maybe matching threads

Xapian discuss - Oct 2018 - Indexing Chinese?

Indexing Chinese?

Indexing Chinese?

Indexing Chinese?

Maybe Matching Threads