Søren Højsgaard
2011-Jan-24 21:44 UTC
[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7
Dear list, Please consider the following call of sort> sort(c("a","f"))[1] "a" "f"> sort(c("f","a"))[1] "a" "f"> > sort(c("aa","ff"))[1] "ff" "aa"> sort(c("ff","aa"))[1] "ff" "aa" The last two results look strange to me. Is that a bug??? The result seems to come from calls to order:> order(c("a","f"))[1] 1 2> order(c("f","a"))[1] 2 1> > order(c("aa","ff"))[1] 2 1> order(c("ff","aa"))[1] 1 2 I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 7. However on Linux, I get the "right answer" (the answer I expected). From the help pages I get the impression that there might be an issue about locale, but I didn't understand the details. Can anyone tell me what goes on here, please Regards S?ren> sessionInfo()R version 2.12.1 Patched (2010-12-27 r53883) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C [5] LC_TIME=Danish_Denmark.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] SHDtools_1.0> sessionInfo()R version 2.12.1 (2010-12-16) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_DK.utf8 LC_NUMERIC=C [3] LC_TIME=en_DK.utf8 LC_COLLATE=en_DK.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_DK.utf8 [7] LC_PAPER=en_DK.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base
Prof Brian Ripley
2011-Jan-24 22:18 UTC
[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7
On Mon, 24 Jan 2011, S?ren H?jsgaard wrote:> Dear list, > > Please consider the following call of sort > >> sort(c("a","f")) > [1] "a" "f" >> sort(c("f","a")) > [1] "a" "f" >> >> sort(c("aa","ff")) > [1] "ff" "aa" >> sort(c("ff","aa")) > [1] "ff" "aa" > The last two results look strange to me. Is that a bug???It seems that you and your OS disagree about Danish, and I'm in no position to know which is correct. But this is not an R issue: the sorting is done by OS services.> The result seems to come from calls to order: > >> order(c("a","f")) > [1] 1 2 >> order(c("f","a")) > [1] 2 1 >> >> order(c("aa","ff")) > [1] 2 1 >> order(c("ff","aa")) > [1] 1 2> I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows > 7. However on Linux, I get the "right answer" (the answer I > expected). From the help pages I get the impression that there might > be an issue about locale, but I didn't understand the details. > > Can anyone tell me what goes on here, pleaseI recall that 'aa' used to sort at the end of the alphabet in Danish telephone books, so it seems the sort used on Windows thinks so too. See ?Comparison for some further details. What I don't understand is that someone resident in Denmark finds this strange .... I get exactly the same in a Danish locale on Mac OS X, for example:> sort(c("aa","ff"))[1] "ff" "aa" and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)> sort(c("aa","ff"))[1] "ff" "aa" en_DK is not a Danish locale (in is English in Denmark). If you want an English sort, try an English locale for LC_COLLATE (there may well be several, hence 'an').> > Regards > S?ren > > > > > > >> sessionInfo() > R version 2.12.1 Patched (2010-12-27 r53883) > Platform: i386-pc-mingw32/i386 (32-bit) > locale: > [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 > [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C > [5] LC_TIME=Danish_Denmark.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] SHDtools_1.0 > > >> sessionInfo() > R version 2.12.1 (2010-12-16) > Platform: i686-pc-linux-gnu (32-bit) > locale: > [1] LC_CTYPE=en_DK.utf8 LC_NUMERIC=C > [3] LC_TIME=en_DK.utf8 LC_COLLATE=en_DK.utf8 > [5] LC_MONETARY=C LC_MESSAGES=en_DK.utf8 > [7] LC_PAPER=en_DK.utf8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595