thr3ads.net - R devel - [Rd] bug in rank(), order(), is.unsorted() on character vector [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Hervé Pagès

2011-Dec-07 09:41 UTC

[Rd] bug in rank(), order(), is.unsorted() on character vector

Hi,

This looks OK:

 > x <- c("_1_", "1_9", "2_9")
 > rank(x)
[1] 1 2 3

But this does not:

 > xa <- paste(x, "a", sep="")
 > xa
[1] "_1_a" "1_9a" "2_9a"
 > rank(xa)
[1] 2 1 3

Cheers,
H.

 > sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.14.0


-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Barry Rowlingson

2011-Dec-07 10:17 UTC

head link

[Rd] bug in rank(), order(), is.unsorted() on character vector

2011/12/7 Herv? Pag?s <hpages at fhcrc.org>:> rank(xa)
See help(Comparison), specifically:

"Beware of making _any_ assumptions about the
     collation order" followed by "Collation of
     non-letters (spaces, punctuation signs, hyphens, fractions and so
     on) is even more problematic."

Barry

Joris Meys

2011-Dec-07 14:48 UTC

head link

[Rd] bug in rank(), order(), is.unsorted() on character vector

@Barry : regardless of whether '_' comes before or after '1' ,
it
should be consistent. Adding an 'a' shouldn't shift '_' from
before
'1' to between '1' and '2', that's clearly an error.
The help files
are not stating anything about that. The only thing I can imagine, is
that '_' gets ignored (in that case 19a would rank before 1a).

This said, I can't reproduce.
> x <- c("_1_", "1_9", "2_9")
> xa <- paste(x,'a',sep='')
> rank(x)
[1] 1 2 3> rank(xa)[1] 1 2 3
> sessionInfo()R version 2.14.0 Patched (2006-00-00 r00000)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United
States.1252

attached base packages:
[1] grDevices datasets  splines   graphics  stats     tcltk     utils
   methods   base

other attached packages:
[1] svSocket_0.9-51 TinnR_1.0.3     R2HTML_2.2      Hmisc_3.8-3
survival_2.36-9

loaded via a namespace (and not attached):
[1] cluster_1.14.1  grid_2.14.0     lattice_0.19-33 svMisc_0.9-63
tools_2.14.0


2011/12/7 Herv? Pag?s <hpages at fhcrc.org>:> Hi,
>
> This looks OK:
>
>> x <- c("_1_", "1_9", "2_9")
>> rank(x)
> [1] 1 2 3
>
> But this does not:
>
>> xa <- paste(x, "a", sep="")
>> xa
> [1] "_1_a" "1_9a" "2_9a"
>> rank(xa)
> [1] 2 1 3
>
> Cheers,
> H.
>
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> ?[1] LC_CTYPE=en_CA.UTF-8 ? ? ? LC_NUMERIC=C
> ?[3] LC_TIME=en_CA.UTF-8 ? ? ? ?LC_COLLATE=en_CA.UTF-8
> ?[5] LC_MONETARY=en_CA.UTF-8 ? ?LC_MESSAGES=en_CA.UTF-8
> ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C
> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>
>
> --
> Herv? Pag?s
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: ?(206) 667-5791
> Fax: ? ?(206) 667-1319
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

Roebuck,Paul L

2011-Dec-07 18:29 UTC

head link

[Rd] bug in rank(), order(), is.unsorted() on character vector

Do this first and try again.

R> Sys.setlocale("LC_COLLATE", "C")


On 12/7/11 3:41 AM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
> Hi,
> 
> This looks OK:
> 
>> x <- c("_1_", "1_9", "2_9")
>> rank(x)
> [1] 1 2 3
> 
> But this does not:
> 
>> xa <- paste(x, "a", sep="")
>> xa
> [1] "_1_a" "1_9a" "2_9a"
>> rank(xa)
> [1] 2 1 3
> 
> Cheers,
> H.
> 
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
>   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>   [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>

Hervé Pagès

2011-Dec-08 18:26 UTC

head link

[Rd] bug in rank(), order(), is.unsorted() on character vector

Hi Barry,

Hope you don't mind if I put this back on the list.

On 11-12-08 05:50 AM, Barry Rowlingson wrote:> 2011/12/8 Herv? Pag?s<hpages at fhcrc.org>:
>
>> A naive question: wouldn't everything be simpler if LC_COLLATE=C
>> was the default for everybody?
>
>   Yet when we Brits suggest everything would be simpler if the whole
> world spoke the Queen's English it causes all sorts of trouble...
:-) Sure I see your point.

But it's a programming language here, used by a lot of researchers.
And having the result of an analysis depend on a crazy collate is
causing all sorts of troubles too.

Note that trying to strike back the Empire is a lost battle anyway.
When you use R (as a user or a developer), any function name you
type (sort, rank, print, summary, etc...) is in Queen's English.
And their man pages too.

Also note that I was just talking about the *default*. AFAIK other
very serious projects like Python or SQLite *by default* use a
collating sequence that behaves like LC_COLLATE=C on strings
that contain ASCII chars only. And they let you change that if you
want. Are they being imperialist? Most R users/developers are in
research or academics where I suspect consistency and reproducibility
is even a bigger deal than in the Python or SQLite community.

Cheers,
H.

>
> Barry

-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Reasonably Related Threads

Search for more possibly parallel threads

R devel - Dec 2011 - bug in rank(), order(), is.unsorted() on character vector

[Rd] bug in rank(), order(), is.unsorted() on character vector

[Rd] bug in rank(), order(), is.unsorted() on character vector

[Rd] bug in rank(), order(), is.unsorted() on character vector

[Rd] bug in rank(), order(), is.unsorted() on character vector

[Rd] bug in rank(), order(), is.unsorted() on character vector

Reasonably Related Threads