I've just been looking at ICU with an eye to reworking the unicode queryparser patch to use it. A few things have jumped out so far which make we wonder if it's the best option. I don't really know what the alternatives are though (currently QueryParser uses glib's unicode routines). The first is that there seems to be bad version skew. Ubuntu breezy (the latest release) has ICU 2.1 and 2.8 packaged, as does Debian sarge (the latest stable release). The latest ICU version is 3.4.1 (and debian unstable only has this version). I can't seem to find what's changed between ICU versions (except for release notes for 3.2 and 3.4 versions), so I worry this is going to be a hassle. The second is that all the multi-statement macro definitions in their headers are just enclosed in a block "{...}" instead of using the familiar "do {...} while (0)" trick to avoid suprise when used in places where an extra ";" matters. This doesn't seem to rate a mention in the user guide, but e.g. /usr/include/unicode/utf8.h says: * <em>Usage:</em> * ICU coding guidelines for if() statements should be followed when using these macros. * Compound statements (curly braces {}) must be used for if-else-while... * bodies and all macro statements should be terminated with semicolon. I don't really like the attitude that *I* have to follow *their* coding guidelines in my own code! If I'm contributing code to their project then I agree it's reasonable to expect adherence to their coding standards, but not just to use their library. By eschewing the standard idiom for wrapping multiline macro calls, they're forcing the risk of silent miscompilation on their users. Finally, they use UTF-16 as their internal representation whereas we want to use UTF-8. For the queryparser, this isn't an issue as there are macros for decoding UTF-8 characters and for saying if a unicode code point is upper case, etc. But in omindex we want to be able to convert between encodings, and it looks like we have to go via UTF-16. I suspect we'd end up writing our own ISO-8859-1 to UTF-8 convertor (that's probably the most common conversion we'd need). Cheers, Olly
On Mon, Apr 10, 2006 at 04:44:21AM +0100, Olly Betts wrote:> By eschewing the standard idiom for wrapping multiline macro calls, > they're forcing the risk of silent miscompilation on their users.I've been thinking about this, and I think there's actually no silent miscompilation risk here (I was thinking of the "dangling else" issue but this is a rather different situation). I think the extra semicolon can only stop code which looks like it should work from compiling (and on the flip side code which looks like it shouldn't compile can - e.g. if you accidentally omit the semicolon). But I still think it's a bloody-minded attitude. A macro which looks like a function should work as one. Michael Schlenker pointed me at the unicode handling code in Tcl. It looks very well done - the source file which provides utf8 handling and unicode codepoint identification for the BMP (i.e. codepoints below 0x10000) compiles to a 28K object file (x86-64) which I think is all the unicode support which the QueryParser class needs. I think here the evils of cut-and-paste code reuse are less than the annoyance of adding a large library dependency to the core library. For Omega we also need encoding conversion, which I think inevitably needs a large bit of code or data. Tcl's code for this is compact, but has 1.3MB of data files. I don't see so much an issue with adding a large library dependency to omega, be it ICU, glib, using Tcl's code, or using an installed version of Tcl. Or something else. What's a good option partly comes down to "what are people likely to have installed anyway". Looking at the debian "popcon" results, the answer seems to be glib, then Tcl, then ICU. But the spread isn't great and ICU is pretty common (openoffice uses it I believe). Not sure how representative the numbers are though, and they may be rather different for non-Linux platforms. Cheers, Olly
Maybe Matching Threads
- Pull requests: CJK words and Snippet generator
- Trying to compile R 3.5.2 - 32 bit R - on Windows 10 64 bit - with ICU support
- Pull requests: CJK words and Snippet generator
- proposal for use ICU for timezone convertion on windows and a draft patch
- Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints