On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote:> I think my main concerns are about efficiency (since that a major > motivation for the current implementation, so slowing it down would be > annoying), and whether we can just make this the standard behaviour > rather than adding an option.The current implementation is O(n) and I took care to keep it at that. For the proposed term coverage, the implementation looks up and inserts terms into a map. That makes it slightly less efficient with an overall complexity of O(n*log n). I could change this to use an unordered_map (which is on average constant), but this this could degenerate to O(n^2) in worst case. If n*log n is acceptable, one could keep the current linear heuristic as default and let users choose the slightly less performant snippet generator with a flag?> What are the other features the fastmail snippet generator has which > the current one lacks? I did study the fastmail one, but that was some > time ago and I don't remember clearly.Off the top of my head: normalization of terms and CJK support. With normalization I mean that the API allows to inject a custom preprocessor for document and search terms before they are matched (that's mainly useful due to a quirk in Cyrus search). To be honest, I am not sure if these features even need to be migrated. I'll run a couple of tests if the current Xapian snippet generator covers them already.> For the CJK segmentation, the ICU dependency makes things more complex, > so I suspect that'll take longer to sort out. For example, Xapian > currently has its own Unicode support, but that presumably means we > could end up using two different versions of Unicode, so perhaps we > ought to use ICU for everything if we're using it at all.At the moment, the pull request only builds on ICU's word segmentation and keeps using Xapian for character set conversions so I don't see much risk of conflicting implementations. In a similar scenario, I recently replaced Cyrus' custom charset support with ICU but we noticed performance degradation for our specific use cases. We ended up porting back the fast, custom-built codepaths for UTF-8 and fall back to ICU for other charsets. That's not to say that ICU isn't a viable choice, but it'd require a thorough assessment. Cheers, Robert
On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote:> On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote: > > I think my main concerns are about efficiency (since that a major > > motivation for the current implementation, so slowing it down would be > > annoying), and whether we can just make this the standard behaviour > > rather than adding an option. > > The current implementation is O(n) and I took care to keep it at that. > For the proposed term coverage, the implementation looks up and inserts > terms into a map. That makes it slightly less efficient with an overall > complexity of O(n*log n). I could change this to use an unordered_map > (which is on average constant), but this this could degenerate to O(n^2) > in worst case. If n*log n is acceptable, one could keep the current > linear heuristic as default and let users choose the slightly less > performant snippet generator with a flag?By "efficiency", I'm meaning in terms of wall-clock time, not the computational complexity of the algorithms. I'm not quite clear what your "n" above is - complexity can be a concern, but some things aren't really able to grow without bound (e.g. number of terms in the query) and so performance for smaller n can actually be what matters, and O() deliberately only considers the limit as n grows ever larger.> > What are the other features the fastmail snippet generator has which > > the current one lacks? I did study the fastmail one, but that was some > > time ago and I don't remember clearly. > > Off the top of my head: normalization of terms and CJK support. With > normalization I mean that the API allows to inject a custom preprocessor > for document and search terms before they are matched (that's mainly > useful due to a quirk in Cyrus search). To be honest, I am not sure if > these features even need to be migrated. I'll run a couple of tests if > the current Xapian snippet generator covers them already.The tokenisation of the snippet uses the same code as indexing does, so CJK should just work automatically, though it looks like there aren't currently any testcases for this, so it would be worth checking (and worth adding some). Normalisation could perhaps be done with a custom stemming algorithm. The indexing pipeline doesn't currently have a separate stage for normalisation and for stemming.> > For the CJK segmentation, the ICU dependency makes things more complex, > > so I suspect that'll take longer to sort out. For example, Xapian > > currently has its own Unicode support, but that presumably means we > > could end up using two different versions of Unicode, so perhaps we > > ought to use ICU for everything if we're using it at all. > > At the moment, the pull request only builds on ICU's word segmentation > and keeps using Xapian for character set conversions so I don't see much > risk of conflicting implementations.Xapian-core doesn't actually convert character sets - text is assumed to be UTF-8 (when it's not handled opaquely). The main issue is that new codepoints get added (and the odd one changes category) in each new Unicode version, so if you're using different Unicode versions at index time and at search time, the terms you get won't match each other. We've so far addressed that by sticking to the same Unicode version within each release branch, but once you add ICU to the mix there's potentially a second version in play, and with a different update schedule. If Xapian's CJK::codepoint_is_cjk() and ICU have different ideas of what's in CJK, the results might be odd, and will likely vary depending on the exact combination of Unicode versions.> In a similar scenario, I recently > replaced Cyrus' custom charset support with ICU but we noticed > performance degradation for our specific use cases. We ended up porting > back the fast, custom-built codepaths for UTF-8 and fall back to ICU for > other charsets. That's not to say that ICU isn't a viable choice, but > it'd require a thorough assessment.Hmm, not the most promising data point. Anyway, will try to get the snippet patch merged first. Cheers, Olly
Olly, sorry for my delayed reply. Am Mo, 12. Sep 2016, um 05:32, schrieb Olly Betts:> On Wed, Sep 07, 2016 at 02:30:16PM +0200, rsto at paranoia.at wrote: > > On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote: > > > I think my main concerns are about efficiency [...] > > For the proposed term coverage, the implementation looks up and inserts > > terms into a map. That makes it slightly less efficient with an overall > > complexity of O(n*log n). > By "efficiency", I'm meaning in terms of wall-clock time, not the > computational complexity of the algorithms. > I'm not quite clear what your "n" above is -n is the number of terms in a document. I haven't done systematic testing of wall-clock time for the new feature. If it is crucial to go ahead with the patch, I could create a couple of benchmarks.> The tokenisation of the snippet uses the same code as indexing does, so > CJK should just work automatically, though it looks like there aren't > currently any testcases for this, so it would be worth checking (and > worth adding some) > > Normalisation could perhaps be done with a custom stemming algorithm. > The indexing pipeline doesn't currently have a separate stage for > normalisation and for stemming.I'll investigate both options with tests and will merge them into Xapian's unit tests where it makes sense. I won't be able to come up with it until next week, though.> The main issue is that new codepoints get added (and the odd one changes > category) in each new Unicode version, so if you're using different > Unicode versions at index time and at search time, the terms you get > won't match each other. [...] If Xapian's CJK::codepoint_is_cjk() and ICU have different ideas of > what's in CJK, the results might be odd, and will likely vary depending > on the exact combination of Unicode versionsICU currently only word-breaks text that `codepoint_is_cjk` before identified as CJK text, there shouldn't be a gap between search and indexing. Yet, I understand your concerns about having two Unicode implementations. Despite our specific experience, migrating Xapian's Unicode handling to ICU might be a good choice and I could support. Surely, its modules are far away from what Xapian's UTF8Iterator currently provides. Cheers, Robert