Dear Peter,
Thanks for the feedback on the locale. Is there a better alternative for
the C locale? One that yields a consistent and stable sorting
independent of the R version and OS.
Best regards,
Thierry
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////
<https://www.inbo.be>
Op di 19 jan. 2021 om 13:20 schreef Peter Dalgaard <pdalgd at gmail.com>:
> Not sure what happened between 4.0.2 and -devel, but you are using C
> collation, which assumes 7-bit single-byte characters, to sort multi-byte
> 8-bit encoded characters, which looks a bit risky.
>
> -pd
>
> > On 19 Jan 2021, at 10:10 , Thierry Onkelinx via R-devel <
> r-devel at r-project.org> wrote:
> >
> > Dear all,
> >
> > My git2rdata package relies on a stable sorting. I've noticed that
> > some characters get a different position under R-devel under Windows
> > 10. This is why the unit test of my package only fail in this
> > combination (
> https://cran.r-project.org/web/checks/check_results_git2rdata.html)
> >
> > Below is a minimal example to illustrate the problem.
> >
> > Best regards,
> >
> > Thierry
> >
> > data <- readLines("
>
https://raw.githubusercontent.com/ropensci/git2rdata/master/tests/testthat/test_b_special.R
> ",
> > encoding = "UTF-8", n = 15)
> > eval(parse(text = paste(tail(data, -3), collapse = "")))
> > ds$a <- enc2utf8(ds$a)
> > print(ds$a) # input
> > Sys.setlocale(locale = "C")
> > print(sort(ds$a)) # sorted
> > print(order(ds$a)) # order
> > print(sessionInfo())
> >
> > # input
> > ## Win 10 R 4.0.2
> > [1] "a" "a b" "a\tb"
"a\tb\tc" "\ta" "a\t"
> "a\nb"
> > [8] "a\nb\nc" "\na" "a\n"
"a\"b" "a\"b\"c" "\"b"
"a\""
> > [15] "\"b\"" "a'b"
"a'b'c" "'b" "a'"
"'b'" "a b c"
> > [22] "\"NA\"" "'NA'" NA
"?" "&" "?" "?"
> > [29] "?" "\200" "|"
"#" "@" "$"
> > ## Win 10 R devel
> > [1] "a" "a b" "a\tb"
"a\tb\tc" "\ta" "a\t"
> "a\nb"
> > [8] "a\nb\nc" "\na" "a\n"
"a\"b" "a\"b\"c" "\"b"
"a\""
> > [15] "\"b\"" "a'b"
"a'b'c" "'b" "a'"
"'b'" "a b c"
> > [22] "\"NA\"" "'NA'" NA
"?" "&" "?" "?"
> > [29] "?" "\200" "|"
"#" "@" "$"
> > ## Ubuntu 18.04 R 4.0.3
> > [1] "a" "a b" "a\tb"
"a\tb\tc" "\ta" "a\t" "a\nb"
> > [8] "a\nb\nc" "\na" "a\n"
"a\"b" "a\"b\"c" "\"b"
"a\""
> > [15] "\"b\"" "a'b"
"a'b'c" "'b" "a'"
"'b'" "a b c"
> > [22] "\"NA\"" "'NA'" NA
"?" "&" "?" "?"
> > [29] "?" "?" "|"
"#" "@" "$"
> >
> > # sorted
> > ## Win 10 R 4.0.2
> > [1] "\ta" "\na"
"\"NA\"" "\"b"
"\"b\"" "#" "$"
> > [8] "&" "'NA'"
"'b" "'b'" "<U+00B5>"
"<U+00E0>"
> "<U+00E7>"
> > [15] "<U+00E9>" "<U+20AC>"
"@" "a" "a\t" "a\tb"
> "a\tb\tc"
> > [22] "a\n" "a\nb" "a\nb\nc"
"a b" "a b c" "a\""
"a\"b"
> > [29] "a\"b\"c" "a'"
"a'b" "a'b'c" "|"
> > ## Win 10 R devel
> > [1] "\ta" "\na"
"\"NA\"" "\"b"
"\"b\"" "#" "$"
> > [8] "&" "'NA'"
"'b" "'b'" "@"
"a" "a\t"
> > [15] "a\tb" "a\tb\tc" "a\n"
"a\nb" "a\nb\nc" "a b" "a b c"
> > [22] "a\"" "a\"b"
"a\"b\"c" "a'" "a'b"
"a'b'c" "|"
> > [29] "\200" "\265" "\340"
"\347" "\351"
> > ## Ubuntu 18.04 R 4.0.3
> > [1] "\ta" "\na"
"\"NA\"" "\"b"
"\"b\"" "#" "$"
> > [8] "&" "'NA'"
"'b" "'b'" "<U+00B5>"
"<U+00E0>"
> "<U+00E7>"
> > [15] "<U+00E9>" "<U+20AC>"
"@" "a" "a\t" "a\tb"
> "a\tb\tc"
> > [22] "a\n" "a\nb" "a\nb\nc"
"a b" "a b c" "a\""
"a\"b"
> > [29] "a\"b\"c" "a'"
"a'b" "a'b'c" "|"
> >
> > # order
> > ## Win 10 R 4.0.2
> > [1] 5 9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33 1 6 3 4 10
> 7 8 2
> > [26] 21 14 11 12 19 16 17 31 24
> > ## Win 10 R devel
> > [1] 5 9 22 13 15 32 34 26 23 18 20 33 1 6 3 4 10 7 8 2 21 14
11
> 12 19
> > [26] 16 17 31 30 28 27 29 25 24
> > ## Ubuntu 18.04 R 4.0.3
> > [1] 5 9 22 13 15 32 34 26 23 18 20 28 27 29 25 30 33 1 6 3 4 10
> 7 8 2
> > [26] 21 14 11 12 19 16 17 31 24
> >
> > R version 4.0.2 (2020-06-22)
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > Running under: Windows 10 x64 (build 18363)
> >
> > Matrix products: default
> >
> > locale:
> > [1] C
> > system code page: 1252
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.0.2 fortunes_1.5-4
> >
> > R Under development (unstable) (2021-01-13 r79826)
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > Running under: Windows 10 x64 (build 18363)
> >
> > Matrix products: default
> >
> > locale:
> > [1] C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.1.0
> >
> > R version 4.0.3 (2020-10-10)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > Running under: Ubuntu 18.04.5 LTS
> >
> > Matrix products: default
> > BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
> > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
> >
> > locale:
> > [1] LC_CTYPE=C LC_NUMERIC=C
> > [3] LC_TIME=C LC_COLLATE=C
> > [5] LC_MONETARY=C LC_MESSAGES=nl_BE.UTF-8
> > [7] LC_PAPER=nl_BE.UTF-8 LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.0.3 fortunes_1.5-4
> >
> >
> > ir. Thierry Onkelinx
> > Statisticus / Statistician
> >
> > Vlaamse Overheid / Government of Flanders
> > INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
> > AND FOREST
> > Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
Assurance
> > thierry.onkelinx at inbo.be
> > Havenlaan 88 bus 73, 1000 Brussel
> > www.inbo.be
> >
> >
>
///////////////////////////////////////////////////////////////////////////////////////////
> > To call in the statistician after the experiment is done may be no
> > more than asking him to perform a post-mortem examination: he may be
> > able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
> > The plural of anecdote is not data. ~ Roger Brinner
> > The combination of some data and an aching desire for an answer does
> > not ensure that a reasonable answer can be extracted from a given body
> > of data. ~ John Tukey
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>
>
>
[[alternative HTML version deleted]]