thr3ads.net - Xapian devel - Pull requests: CJK words and Snippet generator [Sep 2016]

If this information is useful, please help other people find it:
Share via:

rsto at paranoia.at

2016-Aug-18 13:31 UTC

Pull requests: CJK words and Snippet generator

Hi,

On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at
wrote:> The CJK word segmentation and snippet pull requests both pass Travis
> since middle/end of last week. Did you find time to look at them?
just checking in if you found time to look at the PRs? It'd be nice to
know a tentative timeline, so I can plan if to build next features on
top of our local fork or the upstream PRs.

Thanks,
Robert

Olly Betts

2016-Sep-06 07:16 UTC

head link

Pull requests: CJK words and Snippet generator

On Thu, Aug 18, 2016 at 03:31:46PM +0200, rsto at paranoia.at
wrote:> On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at wrote:
> > The CJK word segmentation and snippet pull requests both pass Travis
> > since middle/end of last week. Did you find time to look at them?
> 
> just checking in if you found time to look at the PRs?
I've not managed to yet find time to do more than a quick skim through
them - I'm still a bit back-logged from my post-1.4.0-release break.

But GSoC has now wrapped up, so hopefully I can finish catching up soon.
> It'd be nice to know a tentative timeline, so I can plan if to build
> next features on top of our local fork or the upstream PRs.
The snippet patch seems like something we should be able to get merged
and backported for 1.4.x fairly efficiently.  We're overdue 1.4.1 so
that's not a realistic target, but .2 or .3 might be.

I think my main concerns are about efficiency (since that a major
motivation for the current implementation, so slowing it down would be
annoying), and whether we can just make this the standard behaviour
rather than adding an option.  It's not a deliberate choice that it
doesn't do this already - when I was implementing it I actually wanted
to favour snippets containing more different terms over ones with
repetitions of the same term, but I failed to come up with an efficient
way to do that.  Your approach looks very promising from the quick look
I took.

What are the other features the fastmail snippet generator has which
the current one lacks?  I did study the fastmail one, but that was some
time ago and I don't remember clearly.

For the CJK segmentation, the ICU dependency makes things more complex,
so I suspect that'll take longer to sort out.  For example, Xapian
currently has its own Unicode support, but that presumably means we
could end up using two different versions of Unicode, so perhaps we
ought to use ICU for everything if we're using it at all.

Cheers,
    Olly

rsto at paranoia.at

2016-Sep-07 12:30 UTC

head link

Pull requests: CJK words and Snippet generator

On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:> I think my main concerns are about efficiency (since that a major
> motivation for the current implementation, so slowing it down would be
> annoying), and whether we can just make this the standard behaviour
> rather than adding an option.
The current implementation is O(n) and I took care to keep it at that.
For the proposed term coverage, the implementation looks up and inserts
terms into a map. That makes it slightly less efficient with an overall
complexity of O(n*log n). I could change this to use an unordered_map
(which is on average constant), but this this could degenerate to O(n^2)
in worst case. If n*log n is acceptable, one could keep the current
linear heuristic as default and let users choose the slightly less
performant snippet generator with a flag?
> What are the other features the fastmail snippet generator has which
> the current one lacks?  I did study the fastmail one, but that was some
> time ago and I don't remember clearly.
Off the top of my head: normalization of terms and CJK support. With
normalization I mean that the API allows to inject a custom preprocessor
for document and search terms before they are matched (that's mainly
useful due to a quirk in Cyrus search). To be honest, I am not sure if
these features even need to be migrated. I'll run a couple of tests if
the current Xapian snippet generator covers them already.
> For the CJK segmentation, the ICU dependency makes things more complex,
> so I suspect that'll take longer to sort out.  For example, Xapian
> currently has its own Unicode support, but that presumably means we
> could end up using two different versions of Unicode, so perhaps we
> ought to use ICU for everything if we're using it at all.
At the moment, the pull request only builds on ICU's word segmentation
and keeps using Xapian for character set conversions so I don't see much
risk of conflicting implementations. In a similar scenario, I recently
replaced Cyrus' custom charset support with ICU but we noticed
performance degradation for our specific use cases. We ended up porting
back the fast, custom-built codepaths for UTF-8 and fall back to ICU for
other charsets. That's not to say that ICU isn't a viable choice, but
it'd require a thorough assessment.

Cheers,
Robert

Seemingly Similar Threads

Search for more reasonably related threads

Xapian devel - Sep 2016 - Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Pull requests: CJK words and Snippet generator

Seemingly Similar Threads