The following is not what I expected in sorting characters (single letters and the same letters with preceding spaces). Can someone enlighten me as to why the following might be a correct result for sorting? ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) ; x [1] "A" "B" "C" " A" " B" " C" ; sort(x) [1] "A" " A" "B" " B" "C" " C" ; sort(x, method="shell") [1] "A" " A" "B" " B" "C" " C" ; sort(x, method="quick") [1] "A" " A" "B" " B" "C" " C" I would expect the result to be " A" " B" " C" "A" "B" "C" instead, going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is what S-Plus thinks the sorted sequence is). Thanks, Andreas Krause PS. Version specs: ; version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 1 minor 9.1 year 2004 month 06 day 21 language R
It is documented to depend on your locale. I get> sort(x)[1] " A" " B" " C" "A" "B" "C" in the C locale. The help page does say so: The sort order for character vectors will depend on the collating sequence of the locale in use: see 'Comparison'. The default collation sequences for standard locales in Linux distros are quite unintuitive (and are not character-by-character either). If you want ASCII, ask for it by LC_COLLATE=C. On Thu, 19 Aug 2004 andreas.krause at pharma.novartis.com wrote:> The following is not what I expected in sorting characters (single letters > and the same letters with preceding spaces). > Can someone enlighten me as to why the following might be a correct result > for sorting? > > ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > ; x > [1] "A" "B" "C" " A" " B" " C" > ; sort(x) > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="shell") > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="quick") > [1] "A" " A" "B" " B" "C" " C" > > I would expect the result to be " A" " B" " C" "A" "B" "C" instead, > going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is > what S-Plus thinks the sorted sequence is).That explicitly says it uses ASCII. I believe that is a deficiency they plan to correct. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sorting depends on the locale. For example, I get > x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > x [1] "A" "B" "C" " A" " B" " C" > sort(x) [1] " A" " B" " C" "A" "B" "C" > On Linux (Fedora Core 1) I set LANG=C, which is necessary for some other (non-R) things to work. -roger andreas.krause at pharma.novartis.com wrote:> The following is not what I expected in sorting characters (single letters > and the same letters with preceding spaces). > Can someone enlighten me as to why the following might be a correct result > for sorting? > > ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > ; x > [1] "A" "B" "C" " A" " B" " C" > ; sort(x) > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="shell") > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="quick") > [1] "A" " A" "B" " B" "C" " C" > > I would expect the result to be " A" " B" " C" "A" "B" "C" instead, > going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is > what S-Plus thinks the sorted sequence is). > > Thanks, > > Andreas Krause > > PS. Version specs: > > ; version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 1 > minor 9.1 > year 2004 > month 06 > day 21 > language R > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
I do get (R-1.9.1 on WinXPPro):> x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > sort(x)[1] " A" " B" " C" "A" "B" "C" The `right' sequence is locale dependent, and I wonder if that's the discrepancy here. Andy> From: andreas.krause at pharma.novartis.com > > The following is not what I expected in sorting characters > (single letters > and the same letters with preceding spaces). > Can someone enlighten me as to why the following might be a > correct result > for sorting? > > ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > ; x > [1] "A" "B" "C" " A" " B" " C" > ; sort(x) > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="shell") > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="quick") > [1] "A" " A" "B" " B" "C" " C" > > I would expect the result to be " A" " B" " C" "A" "B" "C" instead, > going by ASCII codes (and a quick check with S-Plus 6.2 shows > that this is > what S-Plus thinks the sorted sequence is). > > Thanks, > > Andreas Krause > > PS. Version specs: > > ; version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 1 > minor 9.1 > year 2004 > month 06 > day 21 > language R > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Thank you very much to Brian Ripley, Roger Peng, and Andy Liaw. Everyone pointed out the same solution. Setting LC_COLLATE=C did it. This default setting is indeed odd to me. Thanks again! Andreas Krause