thr3ads.net - R help - [R] Sorting of character vectors [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Pascal A. Niklaus

2016-Nov-08 12:18 UTC

[R] Sorting of character vectors

I just got caught by the way in character vectors are sorted.

It seems that on my machine "sort" (and related functions like
"order")
only consider characters related to punctuation (at least here the "+"
and "-") when there is no difference in the remaining characters:

 > x1 <- c("-A","+A")
 > x2 <- c("+A","-A")
 > sort(x1)    # sorting is according to "-" and "+"
[1] "-A" "+A"
 > sort(x2)
[1] "-A" "+A"

 > x3 <- c("-Aa","-Ab")
 > x4 <- c("-Aa","+Ab")
 > x5 <- c("+Aa","-Ab")
 > sort(x3)
[1] "-Aa" "-Ab" # here the "+" and "-"
are ignored
 > sort(x4)
[1] "-Aa" "+Ab"
 > sort(x5)
[1] "+Aa" "-Ab"

I understand from the help that this depends on how characters are 
collated, and that this scheme follows the multi-level comparison in 
unicode (http://www.unicode.org/reports/tr10/).

However, what I need is a strict left-to-right comparison of the sort 
provided by strcmp or wcscmp in glibc. The particular ordering of 
special characters is not so important, but there should be no 
"multi-level" aspect to the sorting.

Is there a way to achieve this in R?

Thanks for your help

Pascal

peter dalgaard

2016-Nov-08 13:36 UTC

head link

[R] Sorting of character vectors

On 08 Nov 2016, at 13:18 , Pascal A. Niklaus <pascal.niklaus at
ieu.uzh.ch> wrote:
> I just got caught by the way in character vectors are sorted.
> 
> It seems that on my machine "sort" (and related functions like
"order") only consider characters related to punctuation (at least
here the "+" and "-") when there is no difference in the
remaining characters:
> 
> > x1 <- c("-A","+A")
> > x2 <- c("+A","-A")
> > sort(x1)    # sorting is according to "-" and "+"
> [1] "-A" "+A"
> > sort(x2)
> [1] "-A" "+A"
> 
> > x3 <- c("-Aa","-Ab")
> > x4 <- c("-Aa","+Ab")
> > x5 <- c("+Aa","-Ab")
> > sort(x3)
> [1] "-Aa" "-Ab" # here the "+" and
"-" are ignored
> > sort(x4)
> [1] "-Aa" "+Ab"
> > sort(x5)
> [1] "+Aa" "-Ab"
> 
> I understand from the help that this depends on how characters are
collated, and that this scheme follows the multi-level comparison in unicode
(http://www.unicode.org/reports/tr10/).
> 
> However, what I need is a strict left-to-right comparison of the sort
provided by strcmp or wcscmp in glibc. The particular ordering of special
characters is not so important, but there should be no "multi-level"
aspect to the sorting.
> 
> Is there a way to achieve this in R?
> 
I'd try one of two ways (the above is not happening for me, so I cannot
test):

(1) Temporarily set the Locale to "C":
Sys.setlocale("LC_COLLATE", "C"). That should work as long
as you stay in good ol' ASCII.
(2) Figure out (Don't look at me!) how to diddle the ICU settings for your
system, icuSetCollate() is claimed to be your friend.

-pd

> Thanks for your help
> 
> Pascal
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Rui Barradas

2016-Nov-08 13:43 UTC

head link

[R] Sorting of character vectors

Hello,

What is your sessionInfo()?
With me it works as expected:

 > sort(c("-", "+"))
[1] "-" "+"
 > sort(c("+", "-"))
[1] "-" "+"
 > x5 <- c("+Aa","-Ab")
 > sort(x5)
[1] "-Ab" "+Aa"

 > sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Portuguese_Portugal.1252 
LC_CTYPE=Portuguese_Portugal.1252
[3] LC_MONETARY=Portuguese_Portugal.1252 LC_NUMERIC=C 

[5] LC_TIME=Portuguese_Portugal.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Hope this helps,

Rui Barradas


Em 08-11-2016 12:18, Pascal A. Niklaus escreveu:> I just got caught by the way in character vectors are sorted.
>
> It seems that on my machine "sort" (and related functions like
"order")
> only consider characters related to punctuation (at least here the
"+"
> and "-") when there is no difference in the remaining characters:
>
>  > x1 <- c("-A","+A")
>  > x2 <- c("+A","-A")
>  > sort(x1)    # sorting is according to "-" and "+"
> [1] "-A" "+A"
>  > sort(x2)
> [1] "-A" "+A"
>
>  > x3 <- c("-Aa","-Ab")
>  > x4 <- c("-Aa","+Ab")
>  > x5 <- c("+Aa","-Ab")
>  > sort(x3)
> [1] "-Aa" "-Ab" # here the "+" and
"-" are ignored
>  > sort(x4)
> [1] "-Aa" "+Ab"
>  > sort(x5)
> [1] "+Aa" "-Ab"
>
> I understand from the help that this depends on how characters are
> collated, and that this scheme follows the multi-level comparison in
> unicode (http://www.unicode.org/reports/tr10/).
>
> However, what I need is a strict left-to-right comparison of the sort
> provided by strcmp or wcscmp in glibc. The particular ordering of
> special characters is not so important, but there should be no
> "multi-level" aspect to the sorting.
>
> Is there a way to achieve this in R?
>
> Thanks for your help
>
> Pascal
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Pascal A. Niklaus

2016-Nov-08 14:52 UTC

head link

[R] Sorting of character vectors

Thanks for all suggestions.

With my build (from the CRAN repo) I don't get ICU support, and setting 
LC_COLLATE to "C" did not help.

 > capabilities("ICU")
   ICU
FALSE

 > sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
  [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=C
  [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
  [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.3.2 tools_3.3.2



However, using stringi::stri_sort did the trick !

R help - Nov 2016 - Sorting of character vectors

[R] Sorting of character vectors

[R] Sorting of character vectors

[R] Sorting of character vectors

[R] Sorting of character vectors