thr3ads.net - Xapian devel - Pull requests: CJK words and Snippet generator [Sep 2016]

If this information is useful, please help other people find it:
Share via:

rsto at paranoia.at

2016-Sep-07 12:30 UTC

Pull requests: CJK words and Snippet generator

On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:> I think my main concerns are about efficiency (since that a major
> motivation for the current implementation, so slowing it down would be
> annoying), and whether we can just make this the standard behaviour
> rather than adding an option.
The current implementation is O(n) and I took care to keep it at that.
For the proposed term coverage, the implementation looks up and inserts
terms into a map. That makes it slightly less efficient with an overall
complexity of O(n*log n). I could change this to use an unordered_map
(which is on average constant), but this this could degenerate to O(n^2)
in worst case. If n*log n is acceptable, one could keep the current
linear heuristic as default and let users choose the slightly less
performant snippet generator with a flag?
> What are the other features the fastmail snippet generator has which
> the current one lacks?  I did study the fastmail one, but that was some
> time ago and I don't remember clearly.
Off the top of my head: normalization of terms and CJK support. With
normalization I mean that the API allows to inject a custom preprocessor
for document and search terms before they are matched (that's mainly
useful due to a quirk in Cyrus search). To be honest, I am not sure if
these features even need to be migrated. I'll run a couple of tests if
the current Xapian snippet generator covers them already.
> For the CJK segmentation, the ICU dependency makes things more complex,
> so I suspect that'll take longer to sort out.  For example, Xapian
> currently has its own Unicode support, but that presumably means we
> could end up using two different versions of Unicode, so perhaps we
> ought to use ICU for everything if we're using it at all.
At the moment, the pull request only builds on ICU's word segmentation
and keeps using Xapian for character set conversions so I don't see much
risk of conflicting implementations. In a similar scenario, I recently
replaced Cyrus' custom charset support with ICU but we noticed
performance degradation for our specific use cases. We ended up porting
back the fast, custom-built codepaths for UTF-8 and fall back to ICU for
other charsets. That's not to say that ICU isn't a viable choice, but
it'd require a thorough assessment.

Cheers,
Robert

Olly Betts

2016-Sep-12 03:32 UTC

head link

Pull requests: CJK words and Snippet generator

On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at
wrote:> On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > I think my main concerns are about efficiency (since that a major
> > motivation for the current implementation, so slowing it down would be
> > annoying), and whether we can just make this the standard behaviour
> > rather than adding an option.
> 
> The current implementation is O(n) and I took care to keep it at that.
> For the proposed term coverage, the implementation looks up and inserts
> terms into a map. That makes it slightly less efficient with an overall
> complexity of O(n*log n). I could change this to use an unordered_map
> (which is on average constant), but this this could degenerate to O(n^2)
> in worst case. If n*log n is acceptable, one could keep the current
> linear heuristic as default and let users choose the slightly less
> performant snippet generator with a flag?
By "efficiency", I'm meaning in terms of wall-clock time, not the
computational complexity of the algorithms.

I'm not quite clear what your "n" above is - complexity can be a
concern, but some things aren't really able to grow without bound (e.g.
number of terms in the query) and so performance for smaller n can
actually be what matters, and O() deliberately only considers the limit
as n grows ever larger.
> > What are the other features the fastmail snippet generator has which
> > the current one lacks?  I did study the fastmail one, but that was
some
> > time ago and I don't remember clearly.
> 
> Off the top of my head: normalization of terms and CJK support. With
> normalization I mean that the API allows to inject a custom preprocessor
> for document and search terms before they are matched (that's mainly
> useful due to a quirk in Cyrus search). To be honest, I am not sure if
> these features even need to be migrated. I'll run a couple of tests if
> the current Xapian snippet generator covers them already.
The tokenisation of the snippet uses the same code as indexing does, so
CJK should just work automatically, though it looks like there aren't
currently any testcases for this, so it would be worth checking (and
worth adding some).

Normalisation could perhaps be done with a custom stemming algorithm.
The indexing pipeline doesn't currently have a separate stage for
normalisation and for stemming.
> > For the CJK segmentation, the ICU dependency makes things more
complex,
> > so I suspect that'll take longer to sort out.  For example, Xapian
> > currently has its own Unicode support, but that presumably means we
> > could end up using two different versions of Unicode, so perhaps we
> > ought to use ICU for everything if we're using it at all.
> 
> At the moment, the pull request only builds on ICU's word segmentation
> and keeps using Xapian for character set conversions so I don't see
much
> risk of conflicting implementations.
Xapian-core doesn't actually convert character sets - text is assumed to
be UTF-8 (when it's not handled opaquely).

The main issue is that new codepoints get added (and the odd one changes
category) in each new Unicode version, so if you're using different
Unicode versions at index time and at search time, the terms you get
won't match each other.  We've so far addressed that by sticking to the
same Unicode version within each release branch, but once you add ICU to
the mix there's potentially a second version in play, and with a
different update schedule.

If Xapian's CJK::codepoint_is_cjk() and ICU have different ideas of
what's in CJK, the results might be odd, and will likely vary depending
on the exact combination of Unicode versions.
> In a similar scenario, I recently
> replaced Cyrus' custom charset support with ICU but we noticed
> performance degradation for our specific use cases. We ended up porting
> back the fast, custom-built codepaths for UTF-8 and fall back to ICU for
> other charsets. That's not to say that ICU isn't a viable choice,
but
> it'd require a thorough assessment.
Hmm, not the most promising data point.  Anyway, will try to get the
snippet patch merged first.

Cheers,
    Olly

rsto at paranoia.at

2016-Sep-19 09:27 UTC

head link

Pull requests: CJK words and Snippet generator

Olly, sorry for my delayed reply.

Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:> On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:
> > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:
> > > I think my main concerns are about efficiency [...]
> > For the proposed term coverage, the implementation looks up and
inserts
> > terms into a map. That makes it slightly less efficient with an
overall
> > complexity of O(n*log n).
> By "efficiency", I'm meaning in terms of wall-clock time, not
the
> computational complexity of the algorithms.
> I'm not quite clear what your "n" above is -
n is the number of terms in a document. I haven't done systematic
testing of wall-clock time for the new feature. If it is crucial to go
ahead with the patch, I could create a couple of benchmarks.
> The tokenisation of the snippet uses the same code as indexing does, so
> CJK should just work automatically, though it looks like there aren't
> currently any testcases for this, so it would be worth checking (and
> worth adding some)
> 
> Normalisation could perhaps be done with a custom stemming algorithm.
> The indexing pipeline doesn't currently have a separate stage for
> normalisation and for stemming.
I'll investigate both options with tests and will merge them into
Xapian's unit tests where it makes sense. I won't be able to come up
with it until next week, though.
> The main issue is that new codepoints get added (and the odd one changes
> category) in each new Unicode version, so if you're using different
> Unicode versions at index time and at search time, the terms you get
> won't match each other.  [...] If Xapian's CJK::codepoint_is_cjk()
and ICU have different ideas of
> what's in CJK, the results might be odd, and will likely vary depending
> on the exact combination of Unicode versions
ICU currently only word-breaks text that `codepoint_is_cjk` before
identified as CJK text, there shouldn't be a gap between search and
indexing. Yet, I understand your concerns about having two Unicode
implementations. Despite our specific experience, migrating Xapian's
Unicode handling to ICU might be a good choice and I could support.
Surely, its modules are far away from what Xapian's UTF8Iterator
currently provides.

Cheers,
Robert

Possibly Parallel Threads

Search for more seemingly similar threads

Xapian devel - Sep 2016 - Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Possibly Parallel Threads