thr3ads.net - R devel - [Rd] collation order [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Thomas Lumley

2006-Mar-17 21:32 UTC

[Rd] collation order

The following caused a hard-to-diagnose problem for a user of the survey 
package.  Presumably this is a strange Unicode thing, but is there a 
convenient reference for how the collation order is determined? I am 
surprised that adding the same character to the end of two strings of the 
same length can change the sorting order.

in en_US.utf8 locale> "1//"<"10/"
[1] TRUE> "1//2"<"10/2"[1] FALSE

in C locale on the same system.> "1//"<"10/"
[1] TRUE> "1//2"<"10/2"[1] TRUE

[This is in r-devel of March 6, but the problem that was reported to me 
involved Windows vs Linux on released versions]

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Simon Urbanek

2006-Mar-17 22:32 UTC

head link

[Rd] collation order

On Mar 17, 2006, at 4:32 PM, Thomas Lumley wrote:
> The following caused a hard-to-diagnose problem for a user of the  
> survey package.  Presumably this is a strange Unicode thing,
It is independent of the encoding:

urbanek at corrino:~$ LC_COLLATE=en_US R --vanilla -q<tr
 > "1//"<"10/"
[1] TRUE
 > "1//2"<"10/2"
[1] FALSE
 > Sys.getlocale("LC_COLLATE")
[1] "en_US"

(en_US is ISO-8859-1 on that machine)

And systems don't seem to agree on anything but C locale:

Mac OS X:
caladan:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr
 > "1//"<"10/"
[1] TRUE
 > "1//2"<"10/2"
[1] TRUE
 > Sys.getlocale("LC_COLLATE")
[1] "en_US"

IRIX:
fry:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr
 > "1//"<"10/"
[1] FALSE
 > "1//2"<"10/2"
[1] FALSE
 > Sys.getlocale("LC_COLLATE")
[1] "en_US"

But at least most systems are consistent in terms of adding a  
character, except for GNU/Linux.

Looking at the locale definitions, GNU/Linux uses "iso14651_t1"  
template for many languages. Maybe the problem is that "/" is defined
in the "SPECIAL" section of the ISO-14651 template, which possibly  
causes / to be completely ignored in the "LATIN" part, which would  
explain the behavior (("1"<"10")==TRUE,
("12"<"102")==FALSE). I
couldn't find anything on what the "offical" en_** collating
should
be so I have no idea whether this is a bug in the GNU/Linux locales  
or not...

Cheers,
Simon

Peter Dalgaard

2006-Mar-17 22:56 UTC

head link

[Rd] collation order

Thomas Lumley <tlumley at u.washington.edu> writes:
> The following caused a hard-to-diagnose problem for a user of the survey 
> package.  Presumably this is a strange Unicode thing, but is there a 
> convenient reference for how the collation order is determined? I am 
> surprised that adding the same character to the end of two strings of the 
> same length can change the sorting order.
> 
> in en_US.utf8 locale
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] FALSE
> 
> in C locale on the same system.
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] TRUE
> 
> [This is in r-devel of March 6, but the problem that was reported to me 
> involved Windows vs Linux on released versions]
Unicode has nothing to do with it (same thing in ISO-8859-1. It is
(I think) about characters being skipped during collating, i.e. same
effect as this:
> Sys.setlocale(locale="C")
[1] "C"> "Thomas  O'Malley" < "Thomas Lumley"
[1] TRUE> Sys.setlocale(locale="en_US.UTF8")[1]
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"> "Thomas  O'Malley" <" Thomas Lumley"[1] FALSE

> 
>  	-thomas
> 
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Possibly Parallel Threads

Search for more apparently analagous threads

R devel - Mar 2006 - collation order

[Rd] collation order

[Rd] collation order

[Rd] collation order

Possibly Parallel Threads