thr3ads.net - Xapian devel - [Xapian-devel] ICU [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2006-Apr-10 03:44 UTC

[Xapian-devel] ICU

I've just been looking at ICU with an eye to reworking the unicode
queryparser patch to use it.  A few things have jumped out so far which
make we wonder if it's the best option.  I don't really know what the
alternatives are though (currently QueryParser uses glib's unicode
routines).

The first is that there seems to be bad version skew.  Ubuntu breezy
(the latest release) has ICU 2.1 and 2.8 packaged, as does Debian sarge
(the latest stable release).  The latest ICU version is 3.4.1 (and
debian unstable only has this version).  I can't seem to find what's
changed between ICU versions (except for release notes for 3.2 and 3.4
versions), so I worry this is going to be a hassle.

The second is that all the multi-statement macro definitions in their
headers are just enclosed in a block "{...}" instead of using the
familiar "do {...} while (0)" trick to avoid suprise when used in
places where an extra ";" matters.

This doesn't seem to rate a mention in the user guide, but e.g.
/usr/include/unicode/utf8.h says:

 * <em>Usage:</em>
 * ICU coding guidelines for if() statements should be followed when using these
macros.
 * Compound statements (curly braces {}) must be used  for if-else-while...
 * bodies and all macro statements should be terminated with semicolon.

I don't really like the attitude that *I* have to follow *their* coding
guidelines in my own code!  If I'm contributing code to their project
then I agree it's reasonable to expect adherence to their coding
standards, but not just to use their library.

By eschewing the standard idiom for wrapping multiline macro calls,
they're forcing the risk of silent miscompilation on their users.

Finally, they use UTF-16 as their internal representation whereas we
want to use UTF-8.  For the queryparser, this isn't an issue as
there are macros for decoding UTF-8 characters and for saying if a
unicode code point is upper case, etc.  But in omindex we want to
be able to convert between encodings, and it looks like we have to
go via UTF-16.  I suspect we'd end up writing our own ISO-8859-1
to UTF-8 convertor (that's probably the most common conversion we'd
need).

Cheers,
    Olly

Olly Betts

2006-Apr-11 01:55 UTC

head link

[Xapian-devel] ICU

On Mon, Apr 10, 2006 at 04:44:21AM +0100, Olly Betts
wrote:> By eschewing the standard idiom for wrapping multiline macro calls,
> they're forcing the risk of silent miscompilation on their users.
I've been thinking about this, and I think there's actually no silent
miscompilation risk here (I was thinking of the "dangling else" issue
but this is a rather different situation).

I think the extra semicolon can only stop code which looks like it
should work from compiling (and on the flip side code which looks like
it shouldn't compile can - e.g. if you accidentally omit the semicolon).

But I still think it's a bloody-minded attitude.  A macro which looks
like a function should work as one.

Michael Schlenker pointed me at the unicode handling code in Tcl.  It
looks very well done - the source file which provides utf8 handling and
unicode codepoint identification for the BMP (i.e. codepoints below
0x10000) compiles to a 28K object file (x86-64) which I think is all
the unicode support which the QueryParser class needs.  I think here
the evils of cut-and-paste code reuse are less than the annoyance of
adding a large library dependency to the core library.

For Omega we also need encoding conversion, which I think inevitably
needs a large bit of code or data.  Tcl's code for this is compact, but
has 1.3MB of data files.  I don't see so much an issue with adding a large
library dependency to omega, be it ICU, glib, using Tcl's code, or using
an installed version of Tcl.  Or something else.

What's a good option partly comes down to "what are people likely to
have installed anyway".  Looking at the debian "popcon" results,
the
answer seems to be glib, then Tcl, then ICU.  But the spread isn't
great and ICU is pretty common (openoffice uses it I believe).  Not sure
how representative the numbers are though, and they may be rather
different for non-Linux platforms.

Cheers,
    Olly

Seemingly Similar Threads

Search for more possibly parallel threads

Xapian devel - Apr 2006 - ICU

[Xapian-devel] ICU

[Xapian-devel] ICU

Seemingly Similar Threads