Hervé Pagès
2011-Dec-07 09:41 UTC
[Rd] bug in rank(), order(), is.unsorted() on character vector
Hi,
This looks OK:
> x <- c("_1_", "1_9", "2_9")
> rank(x)
[1] 1 2 3
But this does not:
> xa <- paste(x, "a", sep="")
> xa
[1] "_1_a" "1_9a" "2_9a"
> rank(xa)
[1] 2 1 3
Cheers,
H.
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.14.0
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Barry Rowlingson
2011-Dec-07 10:17 UTC
[Rd] bug in rank(), order(), is.unsorted() on character vector
2011/12/7 Herv? Pag?s <hpages at fhcrc.org>:> rank(xa)See help(Comparison), specifically: "Beware of making _any_ assumptions about the collation order" followed by "Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic." Barry
Joris Meys
2011-Dec-07 14:48 UTC
[Rd] bug in rank(), order(), is.unsorted() on character vector
@Barry : regardless of whether '_' comes before or after '1' , it should be consistent. Adding an 'a' shouldn't shift '_' from before '1' to between '1' and '2', that's clearly an error. The help files are not stating anything about that. The only thing I can imagine, is that '_' gets ignored (in that case 19a would rank before 1a). This said, I can't reproduce.> x <- c("_1_", "1_9", "2_9") > xa <- paste(x,'a',sep='') > rank(x)[1] 1 2 3> rank(xa)[1] 1 2 3> sessionInfo()R version 2.14.0 Patched (2006-00-00 r00000) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] grDevices datasets splines graphics stats tcltk utils methods base other attached packages: [1] svSocket_0.9-51 TinnR_1.0.3 R2HTML_2.2 Hmisc_3.8-3 survival_2.36-9 loaded via a namespace (and not attached): [1] cluster_1.14.1 grid_2.14.0 lattice_0.19-33 svMisc_0.9-63 tools_2.14.0 2011/12/7 Herv? Pag?s <hpages at fhcrc.org>:> Hi, > > This looks OK: > >> x <- c("_1_", "1_9", "2_9") >> rank(x) > [1] 1 2 3 > > But this does not: > >> xa <- paste(x, "a", sep="") >> xa > [1] "_1_a" "1_9a" "2_9a" >> rank(xa) > [1] 2 1 3 > > Cheers, > H. > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > ?[1] LC_CTYPE=en_CA.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_CA.UTF-8 ? ? ? ?LC_COLLATE=en_CA.UTF-8 > ?[5] LC_MONETARY=en_CA.UTF-8 ? ?LC_MESSAGES=en_CA.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > loaded via a namespace (and not attached): > [1] tools_2.14.0 > > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
Roebuck,Paul L
2011-Dec-07 18:29 UTC
[Rd] bug in rank(), order(), is.unsorted() on character vector
Do this first and try again.
R> Sys.setlocale("LC_COLLATE", "C")
On 12/7/11 3:41 AM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
> Hi,
>
> This looks OK:
>
>> x <- c("_1_", "1_9", "2_9")
>> rank(x)
> [1] 1 2 3
>
> But this does not:
>
>> xa <- paste(x, "a", sep="")
>> xa
> [1] "_1_a" "1_9a" "2_9a"
>> rank(xa)
> [1] 2 1 3
>
> Cheers,
> H.
>
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>
Hervé Pagès
2011-Dec-08 18:26 UTC
[Rd] bug in rank(), order(), is.unsorted() on character vector
Hi Barry, Hope you don't mind if I put this back on the list. On 11-12-08 05:50 AM, Barry Rowlingson wrote:> 2011/12/8 Herv? Pag?s<hpages at fhcrc.org>: > >> A naive question: wouldn't everything be simpler if LC_COLLATE=C >> was the default for everybody? > > Yet when we Brits suggest everything would be simpler if the whole > world spoke the Queen's English it causes all sorts of trouble...:-) Sure I see your point. But it's a programming language here, used by a lot of researchers. And having the result of an analysis depend on a crazy collate is causing all sorts of troubles too. Note that trying to strike back the Empire is a lost battle anyway. When you use R (as a user or a developer), any function name you type (sort, rank, print, summary, etc...) is in Queen's English. And their man pages too. Also note that I was just talking about the *default*. AFAIK other very serious projects like Python or SQLite *by default* use a collating sequence that behaves like LC_COLLATE=C on strings that contain ASCII chars only. And they let you change that if you want. Are they being imperialist? Most R users/developers are in research or academics where I suspect consistency and reproducibility is even a bigger deal than in the Python or SQLite community. Cheers, H.> > Barry-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319