Hi, The Cyrus IMAP mail server uses Xapian as search engine. Recently, FastMail has sponsored implementation of two Xapian features: CJK word splitting and a generator for search snippets. I've been working on both features and we would be happy to get them merged into Xapian master. The CJK word tokenizer uses the word segmentation algorithms of the International Components for Unicode library (ICU), which brings support for Japanese, Korean and Thai, among others. The feature co-exists with n-grams (which remain the default for CJK text) and the code is unit-tested [1]. In the feature branch, libicu is mandatory to build but that'd be easy to make optional. The search snippet generator has been an independent effort to Xapian's MSet::snippet generator. It orders snippets within a document by their relevance to the search terms, supports CJK and handles punctuation. The unit tests in the commit [2] outline its main capabilities. Would you be interested in these features? Just let us know what would be required to get them merged. As a minimum I'd rebase the current forks against latest master. I'll be happy to answer any questions or change requests. Cheers, Robert [1] CJK word splitter: https://github.com/rsto/xapian/commit/16dd9b232eb9b6e7346184db0790b6655180492c [2] Snippet generator: https://github.com/rsto/xapian/commit/979757c161ec912c98f2fe87595d7529740e3247#diff-832f4feb83e5ba60ebb64b4d8b93d93fR1
On Tue, Jul 26, 2016 at 03:06:07PM +0200, rsto at paranoia.at wrote:> The Cyrus IMAP mail server uses Xapian as search engine. Recently, > FastMail has sponsored implementation of two Xapian features: CJK word > splitting and a generator for search snippets. I've been working on both > features and we would be happy to get them merged into Xapian master. > > Would you be interested in these features? Just let us know what would > be required to get them merged. As a minimum I'd rebase the current > forks against latest master. I'll be happy to answer any questions or > change requests.This sounds great! I know sufficiently little about CJK that I won't try to comment on that at all :) I think I'm right in saying that your snippet generator: a. needs driving separately (so it's not integrated in the way Xapian::MSet::snippet() is; is the intention that it replaced the current snippet system as something more sophisticated? I wonder if we can arrange suitable defaults to use your implementation with the older API, and come up with a newer API that allows a SnippetGenerator class to be used from the MSet. (That might allow us to refactor the existing implementation and provide both, if they have different strengths. I can't remember much detail of the current one, offhand.) b. only works with UTF8 (I assume that the pre_match & post_match strings, and inter_snippet, should also be in UTF8?) This probably just needs noting in the docstrings. A good start would certainly be rebasing against master and opening a pull request for each on github (this will trigger travis CI builds, which is a helpful first pass in making sure everything good; it runs against both G++ and Clang, which can expose some weirdnesses). J -- James Aylett, occasional trouble-maker xapian.org
Hi James, thanks for the feedback. On Thu, Jul 28, 2016, at 00:22, James Aylett wrote:> This sounds great! I know sufficiently little about CJK that I won't > try to comment on that at all :)I've just opened a pull request for the CJK tokenizer: https://github.com/xapian/xapian/pull/114> I wonder if we can arrange suitable defaults to use your > implementation with the older API, and come up with a newer API that > allows a SnippetGenerator class to be used from the MSet.The FastMail snippet generator has been written when MSet didn't create snippets. I'll first compare both implementations to see if there is a good reason for them to coexist, or might just as well merge any additional features into MSet.> A good start would certainly be rebasing against master and opening a > pull request for each on github (this will trigger travis CI builds, > which is a helpful first pass in making sure everything good; it runs > against both G++ and Clang, which can expose some weirdnesses).Unfortunately, Travis breaks since pkg-config can't find libicu on the machine [1]. I could make the libicu dependency optional, and that might be useful for Xapian installation that don't bother with CJK text, but for Travis tests it would make sense to enable ICU. Cheers, Robert [1] https://travis-ci.org/xapian/xapian/jobs/148268282#L1522