Dear R users, I'm a bit perplexed with the effect sort has here, as it is different on Windows vs. linux. It makes my factor levels and subsequent plots different on the two systems. Given: types <- c("PC-D-Euro-0", "PC-D-Euro-1", "PC-D-Euro-2", "PC-D-Euro-3", "PC-D-Euro-4", "PC-D-Euro-5", "PC-D-Euro-6", "LCV-D-Euro-0", "LCV-D-Euro-1", "LCV-D-Euro-2", "LCV-D-Euro-3", "LCV-D-Euro-4", "LCV-D-Euro-5", "LCV-D-Euro-6", "HGV-D-Euro-0", "HGV-D-Euro-I", "HGV-D-Euro-II", "HGV-D-Euro-III", "HGV-D-Euro-IV EGR", "HGV-D-Euro-IV SCR", "HGV-D-Euro-IV SCRb", "HGV-D-Euro-V EGR", "HGV-D-Euro-V SCR", "HGV-D-Euro-V SCRb", "HGV-D-Euro-VI", "HGV-D-Euro-VIb") On linux, sort does: sort(types) [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II" [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-VI" [10] "HGV-D-Euro-VIb" "HGV-D-Euro-V SCR" "HGV-D-Euro-V SCRb" [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2" [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5" [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1" [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4" [25] "PC-D-Euro-5" "PC-D-Euro-6" And on Windows: sort(types) [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II" [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-V SCR" [10] "HGV-D-Euro-V SCRb" "HGV-D-Euro-VI" "HGV-D-Euro-VIb" [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2" [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5" [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1" [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4" [25] "PC-D-Euro-5" "PC-D-Euro-6" Session info for both systems is below. The order I actually want is the Windows one, but looking at it, the linux order is perhaps more intuitive. However, the problem is the order is inconsistent between the two systems. Any suggestions? sessionInfo() R version 2.11.0 (2010-04-22) x86_64-pc-linux-gnu locale: [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 [5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8 [7] LC_PAPER=en_GB.utf8 LC_NAME=en_GB.utf8 [9] LC_ADDRESS=en_GB.utf8 LC_TELEPHONE=en_GB.utf8 [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=en_GB.utf8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rkward_0.5.3 loaded via a namespace (and not attached): [1] tools_2.11.0> sessionInfo()R version 2.11.0 (2010-04-22) x86_64-pc-mingw32 locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base Dr David Carslaw King's College London Environmental Research Group Franklin Wilkins Building 150 Stamford Street London SE1 9NH -- View this message in context: http://r.789695.n4.nabble.com/difference-in-sort-order-linux-Windows-R-2-11-0-tp2234251p2234251.html Sent from the R help mailing list archive at Nabble.com.
On 28-May-10 08:17:49, carslaw wrote:> Dear R users, > > I'm a bit perplexed with the effect sort has here, as it is different > on Windows vs. linux. > It makes my factor levels and subsequent plots different on the two > systems. > > Given: > > types <- c("PC-D-Euro-0", "PC-D-Euro-1", "PC-D-Euro-2", "PC-D-Euro-3", > "PC-D-Euro-4", "PC-D-Euro-5", "PC-D-Euro-6", "LCV-D-Euro-0", > "LCV-D-Euro-1", "LCV-D-Euro-2", "LCV-D-Euro-3", "LCV-D-Euro-4", > "LCV-D-Euro-5", "LCV-D-Euro-6", "HGV-D-Euro-0", "HGV-D-Euro-I", > "HGV-D-Euro-II", "HGV-D-Euro-III", "HGV-D-Euro-IV EGR", "HGV-D-Euro-IV > SCR", > "HGV-D-Euro-IV SCRb", "HGV-D-Euro-V EGR", "HGV-D-Euro-V SCR", > "HGV-D-Euro-V SCRb", "HGV-D-Euro-VI", "HGV-D-Euro-VIb") > > On linux, sort does: > > sort(types) > [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II" > [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" > [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-VI" > [10] "HGV-D-Euro-VIb" "HGV-D-Euro-V SCR" "HGV-D-Euro-V SCRb" > [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2" > [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5" > [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1" > [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4" > [25] "PC-D-Euro-5" "PC-D-Euro-6" > > > And on Windows: > > sort(types) > > [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II" > [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" > [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-V SCR" > [10] "HGV-D-Euro-V SCRb" "HGV-D-Euro-VI" "HGV-D-Euro-VIb" > [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2" > [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5" > [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1" > [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4" > [25] "PC-D-Euro-5" "PC-D-Euro-6" > > Session info for both systems is below. The order I actually want is > the > Windows one, but looking at it, > the linux order is perhaps more intuitive. However, the problem is > the > order is inconsistent between > the two systems. Any suggestions? > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-linux-gnu > > locale: > [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C > [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 > [5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8 > [7] LC_PAPER=en_GB.utf8 LC_NAME=en_GB.utf8 > [9] LC_ADDRESS=en_GB.utf8 LC_TELEPHONE=en_GB.utf8 > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=en_GB.utf8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rkward_0.5.3 > > loaded via a namespace (and not attached): > [1] tools_2.11.0 > >> sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 > [2] LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > Dr David CarslawI suspect the result (in Linux, I can't test this on Windows) may be related to the following phenomenon: sort(c("AB CD","ABCD")) # [1] "ABCD" "AB CD" sort(c("AB CD","ABCD ")) # [1] "AB CD" "ABCD " I.e. "ABCD" precedes "AB CD" apparently because it is shorter, despite the fact that it would come later in an alphabetical sort. If I use the Linux 'sort' command (on the same machine) I get: sort << EOT "AB CD" "ABCD" EOT "AB CD" "ABCD" sort << EOT "AB CD" "ABCD " EOT "AB CD" "ABCD " I.e. the same result for either case. In my view the R result is anomalous! In ?Comparison it is stated that characters are translated to UTF8 before conparison is done; so a possible explanation could be that the UTF8 encoding for SPACE (for all I know) may be greater than that for the letters of the alphabet (as opposed to ASCII, where -- I do know -- it is less). And, if that is the case, why doesn't it apply also in Windows? This strikes me as a nasty little trap! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 28-May-10 Time: 10:55:33 ------------------------------ XFMail ------------------------------
Thanks Ted, Indeed, there is a difference between the systems on your much-simplified example (thanks). So, linux: sort(c("AB CD","ABCD")) [1] "ABCD" "AB CD" Windows: sort(c("AB CD","ABCD")) [1] "AB CD" "ABCD" Regards, David -- View this message in context: http://r.789695.n4.nabble.com/difference-in-sort-order-linux-Windows-R-2-11-0-tp2234251p2234366.html Sent from the R help mailing list archive at Nabble.com.
Duncan Murdoch
2010-May-28 10:47 UTC
[R] difference in sort order linux/Windows (R.2.11.0)
carslaw wrote:> Dear R users, > > I'm a bit perplexed with the effect sort has here, as it is different on > Windows vs. linux. > It makes my factor levels and subsequent plots different on the two systems. >You are using different collation orders. On Linux, your sessionInfo shows en_GB.utf8 while Windows shows English_United Kingdom.1252 so you should be prepared for differences. That said, it certainly looks as though the string comparison is wrong on Linux. Using Ted Harding's examples, I get these results: > "AB CD" > "ABCD" [1] FALSE > "AB CD" > "ABCD " [1] FALSE on Windows in the English_Canada.1252 locale and on Linux in the C locale. However, when I use the locale that's default on our system, en_US.UTF-8, I get > "AB CD" > "ABCD" [1] TRUE > "AB CD" > "ABCD " [1] FALSE as Ted did, and that certainly looks wrong. Duncan Murdoch> Given: > > types <- c("PC-D-Euro-0", "PC-D-Euro-1", "PC-D-Euro-2", "PC-D-Euro-3", > "PC-D-Euro-4", "PC-D-Euro-5", "PC-D-Euro-6", "LCV-D-Euro-0", > "LCV-D-Euro-1", "LCV-D-Euro-2", "LCV-D-Euro-3", "LCV-D-Euro-4", > "LCV-D-Euro-5", "LCV-D-Euro-6", "HGV-D-Euro-0", "HGV-D-Euro-I", > "HGV-D-Euro-II", "HGV-D-Euro-III", "HGV-D-Euro-IV EGR", "HGV-D-Euro-IV SCR", > "HGV-D-Euro-IV SCRb", "HGV-D-Euro-V EGR", "HGV-D-Euro-V SCR", > "HGV-D-Euro-V SCRb", "HGV-D-Euro-VI", "HGV-D-Euro-VIb") > > On linux, sort does: > > sort(types) > [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II" > [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" > [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-VI" > [10] "HGV-D-Euro-VIb" "HGV-D-Euro-V SCR" "HGV-D-Euro-V SCRb" > [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2" > [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5" > [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1" > [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4" > [25] "PC-D-Euro-5" "PC-D-Euro-6" > > > And on Windows: > > sort(types) > > [1] "HGV-D-Euro-0" "HGV-D-Euro-I" "HGV-D-Euro-II" > [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" > [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-V SCR" > [10] "HGV-D-Euro-V SCRb" "HGV-D-Euro-VI" "HGV-D-Euro-VIb" > [13] "LCV-D-Euro-0" "LCV-D-Euro-1" "LCV-D-Euro-2" > [16] "LCV-D-Euro-3" "LCV-D-Euro-4" "LCV-D-Euro-5" > [19] "LCV-D-Euro-6" "PC-D-Euro-0" "PC-D-Euro-1" > [22] "PC-D-Euro-2" "PC-D-Euro-3" "PC-D-Euro-4" > [25] "PC-D-Euro-5" "PC-D-Euro-6" > > Session info for both systems is below. The order I actually want is the > Windows one, but looking at it, > the linux order is perhaps more intuitive. However, the problem is the > order is inconsistent between > the two systems. Any suggestions? > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-linux-gnu > > locale: > [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C > [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 > [5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8 > [7] LC_PAPER=en_GB.utf8 LC_NAME=en_GB.utf8 > [9] LC_ADDRESS=en_GB.utf8 LC_TELEPHONE=en_GB.utf8 > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=en_GB.utf8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rkward_0.5.3 > > loaded via a namespace (and not attached): > [1] tools_2.11.0 > > >> sessionInfo() >> > R version 2.11.0 (2010-04-22) > x86_64-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 > [2] LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > Dr David Carslaw > King's College London > Environmental Research Group > Franklin Wilkins Building > 150 Stamford Street > London > SE1 9NH >
Pretty obvious: You use different locales (collate). What happens if you use the same on both machines? Cheers Joris On Fri, May 28, 2010 at 10:17 AM, carslaw <david.carslaw@kcl.ac.uk> wrote:> > Dear R users, > > I'm a bit perplexed with the effect sort has here, as it is different on > ... > the linux order is perhaps more intuitive. However, the problem is the > order is inconsistent between > the two systems. Any suggestions? > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-linux-gnu > > locale: > [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C > [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 > [5] LC_MONETARY=en_GB.utf8 LC_MESSAGES=en_GB.utf8 > [7] LC_PAPER=en_GB.utf8 LC_NAME=en_GB.utf8 > [9] LC_ADDRESS=en_GB.utf8 LC_TELEPHONE=en_GB.utf8 > [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=en_GB.utf8 > ... > > sessionInfo() > R version 2.11.0 (2010-04-22) > x86_64-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 > [2] LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > ... > Dr David Carslaw > King's College London > Environmental Research Group > Franklin Wilkins Building > 150 Stamford Street > London > SE1 9NH > -- > View this message in context: > http://r.789695.n4.nabble.com/difference-in-sort-order-linux-Windows-R-2-11-0-tp2234251p2234251.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 Joris.Meys@Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Steven Lembark
2010-May-30 15:20 UTC
[R] difference in sort order linux/Windows (R.2.11.0)
On Fri, 28 May 2010 01:17:49 -0700 (PDT) carslaw <david.carslaw at kcl.ac.uk> wrote:> [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR" > [4] "HGV-D-Euro-III" "HGV-D-Euro-IV EGR" "HGV-D-Euro-IV SCR"> [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-VI" > [7] "HGV-D-Euro-IV SCRb" "HGV-D-Euro-V EGR" "HGV-D-Euro-V SCR"This is a lexical sort. Depending on the locale the items may not sort in ASCII order. For example, a European-latin locale may have some letters in different places than ASCII. You have to check what is being sorted (e.g., map the stuff to UTF8 binary). You might also find that input generated on windog has "smart spaces" in it from the generating program (e.g., Excell) that are something like \xA0 instead of \x20 (32d) used in ASCII spaces. Suggestion: Validate the data with something like "od -cx" on linux so you know what you are sorting. Then dump it out as hex in R [sorry, I have no idea how to do that] and see if what you are sorting matches. After that validate the LOCALE setting on both sides. If all of those turn up the same raw data then you've found a bug in R -- or at least need to read some fine print in the lexical sort docs. -- Steven Lembark 85-09 90th St. Workhorse Computing Woodhaven, NY, 11421 lembark at wrkhors.com +1 888 359 3508