thr3ads.net - Xapian devel - Pull requests: CJK words and Snippet generator [Dec 2016]

If this information is useful, please help other people find it:
Share via:

rsto at paranoia.at

2016-Sep-19 09:27 UTC

Pull requests: CJK words and Snippet generator

Olly, sorry for my delayed reply.

Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:> On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > I think my main concerns are about efficiency [...]
> > For the proposed term coverage, the implementation looks up and
inserts
> > terms into a map. That makes it slightly less efficient with an
overall
> > complexity of O(n*log n).
> By "efficiency", I'm meaning in terms of wall-clock time, not
the
> computational complexity of the algorithms.
> I'm not quite clear what your "n" above is -
n is the number of terms in a document. I haven't done systematic
testing of wall-clock time for the new feature. If it is crucial to go
ahead with the patch, I could create a couple of benchmarks.
> The tokenisation of the snippet uses the same code as indexing does, so
> CJK should just work automatically, though it looks like there aren't
> currently any testcases for this, so it would be worth checking (and
> worth adding some)
> 
> Normalisation could perhaps be done with a custom stemming algorithm.
> The indexing pipeline doesn't currently have a separate stage for
> normalisation and for stemming.
I'll investigate both options with tests and will merge them into
Xapian's unit tests where it makes sense. I won't be able to come up
with it until next week, though.
> The main issue is that new codepoints get added (and the odd one changes
> category) in each new Unicode version, so if you're using different
> Unicode versions at index time and at search time, the terms you get
> won't match each other.  [...] If Xapian's CJK::codepoint_is_cjk()
and ICU have different ideas of
> what's in CJK, the results might be odd, and will likely vary depending
> on the exact combination of Unicode versions
ICU currently only word-breaks text that `codepoint_is_cjk` before
identified as CJK text, there shouldn't be a gap between search and
indexing. Yet, I understand your concerns about having two Unicode
implementations. Despite our specific experience, migrating Xapian's
Unicode handling to ICU might be a good choice and I could support.
Surely, its modules are far away from what Xapian's UTF8Iterator
currently provides.

Cheers,
Robert

Bron Gondwana

2016-Oct-03 23:37 UTC

head link

Pull requests: CJK words and Snippet generator

On Mon, 19 Sep 2016, at 20:27, rsto at paranoia.at
wrote:> Olly, sorry for my delayed reply.
> 
> Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:
> > On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > > I think my main concerns are about efficiency [...]
> > > For the proposed term coverage, the implementation looks up and
inserts
> > > terms into a map. That makes it slightly less efficient with an
overall
> > > complexity of O(n*log n).
> > By "efficiency", I'm meaning in terms of wall-clock
time, not the
> > computational complexity of the algorithms.
> > I'm not quite clear what your "n" above is -
> 
> n is the number of terms in a document. I haven't done systematic
> testing of wall-clock time for the new feature. If it is crucial to go
> ahead with the patch, I could create a couple of benchmarks.
Is there a good dataset to run benchmarks against?  We'll be testing this
shortly on FastMail, but there will be enough confounding factors that it
won't be a realistic benchmark of just the individual changes to Xapian.
> > The tokenisation of the snippet uses the same code as indexing does,
so
> > CJK should just work automatically, though it looks like there
aren't
> > currently any testcases for this, so it would be worth checking (and
> > worth adding some)
> > 
> > Normalisation could perhaps be done with a custom stemming algorithm.
> > The indexing pipeline doesn't currently have a separate stage for
> > normalisation and for stemming.
> 
> I'll investigate both options with tests and will merge them into
> Xapian's unit tests where it makes sense. I won't be able to come
up
> with it until next week, though.
> 
> > The main issue is that new codepoints get added (and the odd one
changes
> > category) in each new Unicode version, so if you're using
different
> > Unicode versions at index time and at search time, the terms you get
> > won't match each other.  [...] If Xapian's
CJK::codepoint_is_cjk() and ICU have different ideas of
> > what's in CJK, the results might be odd, and will likely vary
depending
> > on the exact combination of Unicode versions
I guess my question here is - how much churn is there here in reality?  Assuming
that existing codepoints never change CJKness and you're always using a
newer version of Unicode at search time than at index time, I think this risk
goes away, because you never index those codepoints.

Making sure Xapian and ICU agree on what is CJK is necessary of course, but
hopefully that could be done in a few hours of machine time just by throwing
every possible codepoint at both libraries and asking them :)

Robert is in Australia visiting the FastMail office to co-work with us for a
couple of months, and I'd love to get this Xapian integration work done
during this time.  We're also looking to release Cyrus IMAPd version 3.0
some time in the next few months, and it would be great to not depend on too
many custom patches!  Ideally I'd like to be running vanilla upstream Xapian
libraries on FastMail's production rather than keeping a separate branch as
well.

Cheers,

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm

Olly Betts

2016-Dec-13 04:04 UTC

head link

Pull requests: CJK words and Snippet generator

On Tue, Oct 04, 2016 at 10:37:49AM +1100, Bron Gondwana
wrote:> Robert is in Australia visiting the FastMail office to co-work with us for
a
> couple of months, and I'd love to get this Xapian integration work done
> during this time.  We're also looking to release Cyrus IMAPd version
3.0 some
> time in the next few months, and it would be great to not depend on too
many
> custom patches!  Ideally I'd like to be running vanilla upstream Xapian
> libraries on FastMail's production rather than keeping a separate
branch as
> well.
Did you get a chance to look at the patch I linked to from the snippet PR?

https://github.com/xapian/xapian/pull/117

Cheers,
    Olly

Maybe Matching Threads

Search for more apparently analagous threads

Xapian devel - Dec 2016 - Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Maybe Matching Threads