My second (and hopefully last) question: is there any more news on indexing Chinese characters and words? Searching online mostly returns results from a decade ago or more, with nothing very conclusive. How close is this to possible? For the time being I'm doing some pre-processing on long strings of Chinese, breaking on punctuation in order to avoid errors. But I have some large corpora of Chinese texts that in the future I'd like to index properly. Thanks, Eric
We are a using a fork of Xapian for this at the Cyrus IMAP project [1], using the Unicode library word segmentation for Chinese, Japanese and Korean [2]. We are using it at FastMail in production since about 2 years and are generally happy with it, the search results improved over using ngrams. There's a pull request open to merge the patch upstream [3], but it's to be decided how to best add this to Xapian. Currently, the upstream patch doesn't build cleanly on the master branch, but I'll look into making it compile cleanly next week. Cheers, Robert [1] https://github.com/cyrusimap/xapian [2] http://site.icu-project.org/ [3] https://github.com/xapian/xapian/pull/114 On Thu, Oct 4, 2018, at 05:20, Eric Abrahamsen wrote:> My second (and hopefully last) question: is there any more news on > indexing Chinese characters and words? Searching online mostly returns > results from a decade ago or more, with nothing very conclusive. How > close is this to possible? > > For the time being I'm doing some pre-processing on long strings of > Chinese, breaking on punctuation in order to avoid errors. But I have > some large corpora of Chinese texts that in the future I'd like to index > properly. > > Thanks, > Eric > >
That's a coincidence! And very good news. I've subscribed to the PR, and will look forward to seeing it land! Thanks a lot, Eric On 10/04/18 03:27 AM, Robert Stepanek wrote:> We are a using a fork of Xapian for this at the Cyrus IMAP project > [1], using the Unicode library word segmentation for Chinese, Japanese > and Korean [2]. We are using it at FastMail in production since about > 2 years and are generally happy with it, the search results improved > over using ngrams. There's a pull request open to merge the patch > upstream [3], but it's to be decided how to best add this to Xapian. > Currently, the upstream patch doesn't build cleanly on the master > branch, but I'll look into making it compile cleanly next week. > > Cheers, > Robert > > [1] https://github.com/cyrusimap/xapian > [2] http://site.icu-project.org/ > [3] https://github.com/xapian/xapian/pull/114 > > On Thu, Oct 4, 2018, at 05:20, Eric Abrahamsen wrote: >> My second (and hopefully last) question: is there any more news on >> indexing Chinese characters and words? Searching online mostly returns >> results from a decade ago or more, with nothing very conclusive. How >> close is this to possible? >> >> For the time being I'm doing some pre-processing on long strings of >> Chinese, breaking on punctuation in order to avoid errors. But I have >> some large corpora of Chinese texts that in the future I'd like to index >> properly. >> >> Thanks, >> Eric >> >>