[Ricardo Rodriguez] Your XEN ICT Team
2008-Apr-15 14:38 UTC
[R] a question of alphabetical order
Hi all, In Spanish vowels with accent like ?, ?, ... doesn't affect to the alphabetical order of vector of strings. I mean, a or ? don't matter for establishing the alphabetical order. Nevertheless, while working with R order, here is what I get. Given a file transport.txt medio#variable avi?n#34 barco#33 bicicleta#3 ?ngulo#37 cami?n#54 coche#23 tren#67 > toPlot <- read.csv("~/Desktop/Workplace/transport.txt",header=TRUE,sep="#") > toPlot[order(toPlot$medio),] medio variable 1 avi?n 34 2 barco 33 3 bicicleta 3 5 cami?n 54 6 coche 23 7 tren 67 4 ?ngulo 37 > I expect ?ngulo appears in the first place as n (in ?ngulo) goes before v (in avi?n) and ?/a doesn't matter for alphabetical order. But ?ngulo appears in the last position. Here my environment: > sessionInfo() R version 2.7.0 beta (2008-04-12 r45280) i386-apple-darwin9.2.2 locale: es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base > version _ platform i386-apple-darwin9.2.2 arch i386 os darwin9.2.2 system i386, darwin9.2.2 status beta major 2 minor 7.0 year 2008 month 04 day 12 svn rev 45280 language R version.string R version 2.7.0 beta (2008-04-12 r45280) > Is it not possible to get this dataframe ordered correctly in Spanish? Other programs (Excel, for instance) do order correctly. Thanks for your help, Ricardo -- Ricardo Rodr?guez Your XEN ICT Team
This is a known Mac OS X bug, nothing to do with R which uses the system functions (strcoll/wcscoll) for such things. If you look at the help for sort, it refers you to ?Comparison. Which says Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see 'locales'. The collating sequence of locales such as 'en_US' is normally different from 'C' (which should use ASCII) and can be surprising. Beware of making _any_ assumptions about the collation order: e.g. in Estonian 'Z' comes between 'S' and 'T', and collation is not necessarily character-by-character - in Danish 'aa' sorts as a single letter, after 'z'. Some platforms may not respect the locale and always sort in ASCII. (String comparison is always for the part of the string up to the first nul if there are embedded nuls.) Mac OS X (more specifically, 10.5.2 on i386) is one of those disrespectful platforms.> x <- intToUtf8(c(32:127, 160:255), multiple=T) > order(x)[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 [181] 181 182 183 184 185 186 187 188 189 190 191 192 which is quite different from Linux or Solaris. This may not come out, but paste(sort(x), collapse="") includes aA???????????????bBcC??dDeE???????? on Linux in es_ES.utf8 . Platforms are a lot worse at sorting in UTF-8 than 8-bit encodings. Mac OS X has es_ES.ISO8859-15, and that does do a reasonable job including a??????? . On Tue, 15 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:> Hi all, > > In Spanish vowels with accent like ?, ?, ... doesn't affect to the > alphabetical order of vector of strings. I mean, a or ? don't matter for > establishing the alphabetical order. > > Nevertheless, while working with R order, here is what I get. > > Given a file transport.txt > > medio#variable > avi?n#34 > barco#33 > bicicleta#3 > ?ngulo#37 > cami?n#54 > coche#23 > tren#67 > > > toPlot <- > read.csv("~/Desktop/Workplace/transport.txt",header=TRUE,sep="#") > > toPlot[order(toPlot$medio),] > medio variable > 1 avi?n 34 > 2 barco 33 > 3 bicicleta 3 > 5 cami?n 54 > 6 coche 23 > 7 tren 67 > 4 ?ngulo 37 > > > > I expect ?ngulo appears in the first place as n (in ?ngulo) goes before > v (in avi?n) and ?/a doesn't matter for alphabetical order. > > But ?ngulo appears in the last position. > > Here my environment: > > > sessionInfo() > R version 2.7.0 beta (2008-04-12 r45280) > i386-apple-darwin9.2.2 > > locale: > es_ES.UTF-8/es_ES.UTF-8/C/C/es_ES.UTF-8/es_ES.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > version > _ > platform i386-apple-darwin9.2.2 > arch i386 > os darwin9.2.2 > system i386, darwin9.2.2 > status beta > major 2 > minor 7.0 > year 2008 > month 04 > day 12 > svn rev 45280 > language R > version.string R version 2.7.0 beta (2008-04-12 r45280) > > > > Is it not possible to get this dataframe ordered correctly in Spanish? > Other programs (Excel, for instance) do order correctly. > > Thanks for your help, > > Ricardo > > -- > Ricardo Rodr?guez > Your XEN ICT Team > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
[Ricardo Rodriguez] Your XEN ICT Team
2008-Apr-16 12:52 UTC
[R] a question of alphabetical order
Hans-Joerg Bibiko wrote:> Hola, > > Muchas gracias! > This is new to me. I learnt Spanish a bit - well - 20 years ago ;) > But this simplifies it.This change happens just 14 years ago! You you are not guilty!> > > Recuerdos > > Hans >Saludos cordiales! "Read" you in Spanish whenever you want! Ricardo -- Ricardo Rodr?guez Your XEN ICT Team