The following is not what I expected in sorting characters (single letters
and the same letters with preceding spaces).
Can someone enlighten me as to why the following might be a correct result
for sorting?
; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep=""))
; x
[1] "A" "B" "C" " A" " B"
" C"
; sort(x)
[1] "A" " A" "B" " B" "C"
" C"
; sort(x, method="shell")
[1] "A" " A" "B" " B" "C"
" C"
; sort(x, method="quick")
[1] "A" " A" "B" " B" "C"
" C"
I would expect the result to be " A" " B" " C"
"A" "B" "C" instead,
going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is
what S-Plus thinks the sorted sequence is).
Thanks,
Andreas Krause
PS. Version specs:
; version
_
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status
major 1
minor 9.1
year 2004
month 06
day 21
language R
It is documented to depend on your locale. I get> sort(x)[1] " A" " B" " C" "A" "B" "C" in the C locale. The help page does say so: The sort order for character vectors will depend on the collating sequence of the locale in use: see 'Comparison'. The default collation sequences for standard locales in Linux distros are quite unintuitive (and are not character-by-character either). If you want ASCII, ask for it by LC_COLLATE=C. On Thu, 19 Aug 2004 andreas.krause at pharma.novartis.com wrote:> The following is not what I expected in sorting characters (single letters > and the same letters with preceding spaces). > Can someone enlighten me as to why the following might be a correct result > for sorting? > > ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > ; x > [1] "A" "B" "C" " A" " B" " C" > ; sort(x) > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="shell") > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="quick") > [1] "A" " A" "B" " B" "C" " C" > > I would expect the result to be " A" " B" " C" "A" "B" "C" instead, > going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is > what S-Plus thinks the sorted sequence is).That explicitly says it uses ASCII. I believe that is a deficiency they plan to correct. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sorting depends on the locale. For example, I get
> x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3],
sep=""))
> x
[1] "A" "B" "C" " A" " B"
" C"
> sort(x)
[1] " A" " B" " C" "A" "B"
"C"
>
On Linux (Fedora Core 1) I set LANG=C, which is necessary for some
other (non-R) things to work.
-roger
andreas.krause at pharma.novartis.com wrote:> The following is not what I expected in sorting characters (single letters
> and the same letters with preceding spaces).
> Can someone enlighten me as to why the following might be a correct result
> for sorting?
>
> ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3],
sep=""))
> ; x
> [1] "A" "B" "C" " A" "
B" " C"
> ; sort(x)
> [1] "A" " A" "B" " B"
"C" " C"
> ; sort(x, method="shell")
> [1] "A" " A" "B" " B"
"C" " C"
> ; sort(x, method="quick")
> [1] "A" " A" "B" " B"
"C" " C"
>
> I would expect the result to be " A" " B" "
C" "A" "B" "C" instead,
> going by ASCII codes (and a quick check with S-Plus 6.2 shows that this is
> what S-Plus thinks the sorted sequence is).
>
> Thanks,
>
> Andreas Krause
>
> PS. Version specs:
>
> ; version
> _
> platform i686-pc-linux-gnu
> arch i686
> os linux-gnu
> system i686, linux-gnu
> status
> major 1
> minor 9.1
> year 2004
> month 06
> day 21
> language R
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>
I do get (R-1.9.1 on WinXPPro):> x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > sort(x)[1] " A" " B" " C" "A" "B" "C" The `right' sequence is locale dependent, and I wonder if that's the discrepancy here. Andy> From: andreas.krause at pharma.novartis.com > > The following is not what I expected in sorting characters > (single letters > and the same letters with preceding spaces). > Can someone enlighten me as to why the following might be a > correct result > for sorting? > > ; x <- c(LETTERS[1:3], paste(" ", LETTERS[1:3], sep="")) > ; x > [1] "A" "B" "C" " A" " B" " C" > ; sort(x) > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="shell") > [1] "A" " A" "B" " B" "C" " C" > ; sort(x, method="quick") > [1] "A" " A" "B" " B" "C" " C" > > I would expect the result to be " A" " B" " C" "A" "B" "C" instead, > going by ASCII codes (and a quick check with S-Plus 6.2 shows > that this is > what S-Plus thinks the sorted sequence is). > > Thanks, > > Andreas Krause > > PS. Version specs: > > ; version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 1 > minor 9.1 > year 2004 > month 06 > day 21 > language R > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Thank you very much to Brian Ripley, Roger Peng, and Andy Liaw. Everyone
pointed out the same solution.
Setting
LC_COLLATE=C
did it. This default setting is indeed odd to me.
Thanks again!
Andreas Krause