Hi, On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at wrote:> The CJK word segmentation and snippet pull requests both pass Travis > since middle/end of last week. Did you find time to look at them?just checking in if you found time to look at the PRs? It'd be nice to know a tentative timeline, so I can plan if to build next features on top of our local fork or the upstream PRs. Thanks, Robert
On Thu, Aug 18, 2016 at 03:31:46PM +0200, rsto at paranoia.at wrote:> On Thu, Aug 11, 2016, at 13:19, rsto at paranoia.at wrote: > > The CJK word segmentation and snippet pull requests both pass Travis > > since middle/end of last week. Did you find time to look at them? > > just checking in if you found time to look at the PRs?I've not managed to yet find time to do more than a quick skim through them - I'm still a bit back-logged from my post-1.4.0-release break. But GSoC has now wrapped up, so hopefully I can finish catching up soon.> It'd be nice to know a tentative timeline, so I can plan if to build > next features on top of our local fork or the upstream PRs.The snippet patch seems like something we should be able to get merged and backported for 1.4.x fairly efficiently. We're overdue 1.4.1 so that's not a realistic target, but .2 or .3 might be. I think my main concerns are about efficiency (since that a major motivation for the current implementation, so slowing it down would be annoying), and whether we can just make this the standard behaviour rather than adding an option. It's not a deliberate choice that it doesn't do this already - when I was implementing it I actually wanted to favour snippets containing more different terms over ones with repetitions of the same term, but I failed to come up with an efficient way to do that. Your approach looks very promising from the quick look I took. What are the other features the fastmail snippet generator has which the current one lacks? I did study the fastmail one, but that was some time ago and I don't remember clearly. For the CJK segmentation, the ICU dependency makes things more complex, so I suspect that'll take longer to sort out. For example, Xapian currently has its own Unicode support, but that presumably means we could end up using two different versions of Unicode, so perhaps we ought to use ICU for everything if we're using it at all. Cheers, Olly
On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:> I think my main concerns are about efficiency (since that a major > motivation for the current implementation, so slowing it down would be > annoying), and whether we can just make this the standard behaviour > rather than adding an option.The current implementation is O(n) and I took care to keep it at that. For the proposed term coverage, the implementation looks up and inserts terms into a map. That makes it slightly less efficient with an overall complexity of O(n*log n). I could change this to use an unordered_map (which is on average constant), but this this could degenerate to O(n^2) in worst case. If n*log n is acceptable, one could keep the current linear heuristic as default and let users choose the slightly less performant snippet generator with a flag?> What are the other features the fastmail snippet generator has which > the current one lacks? I did study the fastmail one, but that was some > time ago and I don't remember clearly.Off the top of my head: normalization of terms and CJK support. With normalization I mean that the API allows to inject a custom preprocessor for document and search terms before they are matched (that's mainly useful due to a quirk in Cyrus search). To be honest, I am not sure if these features even need to be migrated. I'll run a couple of tests if the current Xapian snippet generator covers them already.> For the CJK segmentation, the ICU dependency makes things more complex, > so I suspect that'll take longer to sort out. For example, Xapian > currently has its own Unicode support, but that presumably means we > could end up using two different versions of Unicode, so perhaps we > ought to use ICU for everything if we're using it at all.At the moment, the pull request only builds on ICU's word segmentation and keeps using Xapian for character set conversions so I don't see much risk of conflicting implementations. In a similar scenario, I recently replaced Cyrus' custom charset support with ICU but we noticed performance degradation for our specific use cases. We ended up porting back the fast, custom-built codepaths for UTF-8 and fall back to ICU for other charsets. That's not to say that ICU isn't a viable choice, but it'd require a thorough assessment. Cheers, Robert