The following caused a hard-to-diagnose problem for a user of the survey package. Presumably this is a strange Unicode thing, but is there a convenient reference for how the collation order is determined? I am surprised that adding the same character to the end of two strings of the same length can change the sorting order. in en_US.utf8 locale> "1//"<"10/"[1] TRUE> "1//2"<"10/2"[1] FALSE in C locale on the same system.> "1//"<"10/"[1] TRUE> "1//2"<"10/2"[1] TRUE [This is in r-devel of March 6, but the problem that was reported to me involved Windows vs Linux on released versions] -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
On Mar 17, 2006, at 4:32 PM, Thomas Lumley wrote:> The following caused a hard-to-diagnose problem for a user of the > survey package. Presumably this is a strange Unicode thing,It is independent of the encoding: urbanek at corrino:~$ LC_COLLATE=en_US R --vanilla -q<tr > "1//"<"10/" [1] TRUE > "1//2"<"10/2" [1] FALSE > Sys.getlocale("LC_COLLATE") [1] "en_US" (en_US is ISO-8859-1 on that machine) And systems don't seem to agree on anything but C locale: Mac OS X: caladan:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr > "1//"<"10/" [1] TRUE > "1//2"<"10/2" [1] TRUE > Sys.getlocale("LC_COLLATE") [1] "en_US" IRIX: fry:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr > "1//"<"10/" [1] FALSE > "1//2"<"10/2" [1] FALSE > Sys.getlocale("LC_COLLATE") [1] "en_US" But at least most systems are consistent in terms of adding a character, except for GNU/Linux. Looking at the locale definitions, GNU/Linux uses "iso14651_t1" template for many languages. Maybe the problem is that "/" is defined in the "SPECIAL" section of the ISO-14651 template, which possibly causes / to be completely ignored in the "LATIN" part, which would explain the behavior (("1"<"10")==TRUE, ("12"<"102")==FALSE). I couldn't find anything on what the "offical" en_** collating should be so I have no idea whether this is a bug in the GNU/Linux locales or not... Cheers, Simon
Thomas Lumley <tlumley at u.washington.edu> writes:> The following caused a hard-to-diagnose problem for a user of the survey > package. Presumably this is a strange Unicode thing, but is there a > convenient reference for how the collation order is determined? I am > surprised that adding the same character to the end of two strings of the > same length can change the sorting order. > > in en_US.utf8 locale > > "1//"<"10/" > [1] TRUE > > "1//2"<"10/2" > [1] FALSE > > in C locale on the same system. > > "1//"<"10/" > [1] TRUE > > "1//2"<"10/2" > [1] TRUE > > [This is in r-devel of March 6, but the problem that was reported to me > involved Windows vs Linux on released versions]Unicode has nothing to do with it (same thing in ISO-8859-1. It is (I think) about characters being skipped during collating, i.e. same effect as this:> Sys.setlocale(locale="C")[1] "C"> "Thomas O'Malley" < "Thomas Lumley"[1] TRUE> Sys.setlocale(locale="en_US.UTF8")[1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"> "Thomas O'Malley" <" Thomas Lumley"[1] FALSE> > -thomas > > Thomas Lumley Assoc. Professor, Biostatistics > tlumley at u.washington.edu University of Washington, Seattle > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907