On Mon, 19 Sep 2016, at 20:27, rsto at paranoia.at
wrote:> Olly, sorry for my delayed reply.
>
> Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:
> > On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > > I think my main concerns are about efficiency [...]
> > > For the proposed term coverage, the implementation looks up and
inserts
> > > terms into a map. That makes it slightly less efficient with an
overall
> > > complexity of O(n*log n).
> > By "efficiency", I'm meaning in terms of wall-clock
time, not the
> > computational complexity of the algorithms.
> > I'm not quite clear what your "n" above is -
>
> n is the number of terms in a document. I haven't done systematic
> testing of wall-clock time for the new feature. If it is crucial to go
> ahead with the patch, I could create a couple of benchmarks.
Is there a good dataset to run benchmarks against? We'll be testing this
shortly on FastMail, but there will be enough confounding factors that it
won't be a realistic benchmark of just the individual changes to Xapian.
> > The tokenisation of the snippet uses the same code as indexing does,
so
> > CJK should just work automatically, though it looks like there
aren't
> > currently any testcases for this, so it would be worth checking (and
> > worth adding some)
> >
> > Normalisation could perhaps be done with a custom stemming algorithm.
> > The indexing pipeline doesn't currently have a separate stage for
> > normalisation and for stemming.
>
> I'll investigate both options with tests and will merge them into
> Xapian's unit tests where it makes sense. I won't be able to come
up
> with it until next week, though.
>
> > The main issue is that new codepoints get added (and the odd one
changes
> > category) in each new Unicode version, so if you're using
different
> > Unicode versions at index time and at search time, the terms you get
> > won't match each other. [...] If Xapian's
CJK::codepoint_is_cjk() and ICU have different ideas of
> > what's in CJK, the results might be odd, and will likely vary
depending
> > on the exact combination of Unicode versions
I guess my question here is - how much churn is there here in reality? Assuming
that existing codepoints never change CJKness and you're always using a
newer version of Unicode at search time than at index time, I think this risk
goes away, because you never index those codepoints.
Making sure Xapian and ICU agree on what is CJK is necessary of course, but
hopefully that could be done in a few hours of machine time just by throwing
every possible codepoint at both libraries and asking them :)
Robert is in Australia visiting the FastMail office to co-work with us for a
couple of months, and I'd love to get this Xapian integration work done
during this time. We're also looking to release Cyrus IMAPd version 3.0
some time in the next few months, and it would be great to not depend on too
many custom patches! Ideally I'd like to be running vanilla upstream Xapian
libraries on FastMail's production rather than keeping a separate branch as
well.
Cheers,
Bron.
--
Bron Gondwana
brong at fastmail.fm