Dear Ivan, this is a very helpful explanation! I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results. I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point. Andreas ?Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at gmail.com>: On Mon, 31 Jan 2022 09:56:27 +0000 "Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote: > After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 > R`, the output of `localeToCharset()` is: > [1] "UTF-8" "ISO8859-1" > why ISO8859-1 might be a fallback option here? ISO8859-1 seems to be offered because it covers the alphabet of American English. Obviously, this doesn't guarantee that the guess is correct. For example, I could symlink the ru_RU.KOI8-R locale on my system to name it "ru_RU", and localeToCharset() would return "ISO8859-5", not knowing the correct answer. ??????, anyone? > Part of my analysis of the code of `localeToCharset()` is that it > targets special scenarios on Windows and macOS, but not on Linux. Well, it almost does the right thing. GNU/Linux locales are typically named like <language>_<country>.<encoding>, and localeToCharset() respects the <encoding> part, but only if the language and the country are specified. A quick fix for that would be to add one final case: Index: src/library/utils/R/iconv.R ================================================================== --- src/library/utils/R/iconv.R (revision 81596) +++ src/library/utils/R/iconv.R (working copy) @@ -135,6 +135,7 @@ if(enc == "utf8") return(c("UTF-8", guess(ll))) else return(guess(ll)) } + if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8 return(NA_character_) } } (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc) && enc != "utf8") branch.) Maybe a better fix would be to restructure the code a bit, to always take the encoding hint and then also try to guess if the locale looks like it provides a language code. -- Best regards, Ivan
Andreas, The output is very predictable, so this is not about predictability. Note that C.UTF-8 is technically an invalid locale by the semantics rules (see below). Also note that the C locale is "C" - it is not allowed to have any string behind the C (or else is not the C locale) so what you have is NOT a C locale (see POSIX 7.2). The issue here is that the POSIX standard provides no semantic rules, locale names can be arbitrary, the only defined one is C (and its synonym POSIX). All others are random locales that can do whatever they want. Then later some systems have introduced semantic guidelines such as the <language>_<territory>.<codeset> convention - that that is what localeToCharsets() expected so it can try to guess the charset for that language. Since C.UTF-8 is such an aberration (not in the standard form) localeToCharset() doesn't know about it and returns NA since it can't guess the language. Long story short, C.UTF-8 breaks all common rules and has been introduced fairly recently to some Linux systems so R doesn't not know about it yet. Ivan's patch fixes that. That aside, locale names have no official provision to provide the charset, so all you get is a guess assuming the system follows the common rules. Cheers, Simon> On Feb 1, 2022, at 00:38, Bl?tte, Andreas <andreas.blaette at uni-due.de> wrote: > > Dear Ivan, > > this is a very helpful explanation! I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results. I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point. > > Andreas > > ?Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at gmail.com>: > > On Mon, 31 Jan 2022 09:56:27 +0000 > "Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote: > >> After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 >> R`, the output of `localeToCharset()` is: >> [1] "UTF-8" "ISO8859-1" > >> why ISO8859-1 might be a fallback option here? > > ISO8859-1 seems to be offered because it covers the alphabet of > American English. Obviously, this doesn't guarantee that the guess is > correct. For example, I could symlink the ru_RU.KOI8-R locale on my > system to name it "ru_RU", and localeToCharset() would return > "ISO8859-5", not knowing the correct answer. ??????, anyone? > >> Part of my analysis of the code of `localeToCharset()` is that it >> targets special scenarios on Windows and macOS, but not on Linux. > > Well, it almost does the right thing. GNU/Linux locales are typically > named like <language>_<country>.<encoding>, and localeToCharset() > respects the <encoding> part, but only if the language and the country > are specified. A quick fix for that would be to add one final case: > > Index: src/library/utils/R/iconv.R > ==================================================================> --- src/library/utils/R/iconv.R (revision 81596) > +++ src/library/utils/R/iconv.R (working copy) > @@ -135,6 +135,7 @@ > if(enc == "utf8") return(c("UTF-8", guess(ll))) > else return(guess(ll)) > } > + if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8 > return(NA_character_) > } > } > > (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc) > && enc != "utf8") branch.) > > Maybe a better fix would be to restructure the code a bit, to always > take the encoding hint and then also try to guess if the locale looks > like it provides a language code. > > -- > Best regards, > Ivan > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Hi Andreas, is there still any higher-level problem left you need to solve? Ideally one wouldn't need to query what is the native encoding, but directly use iconv() or indirectly other R functions to convert the data from/to the native encoding. iconv() will find out internally what is the native encoding (via data that is available also by l10n_info(), but with care for differences between OSes). Best Tomas On 1/31/22 12:38, Bl?tte, Andreas wrote:> Dear Ivan, > > this is a very helpful explanation! I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results. I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point. > > Andreas > > ?Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at gmail.com>: > > On Mon, 31 Jan 2022 09:56:27 +0000 > "Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote: > > > After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 > > R`, the output of `localeToCharset()` is: > > [1] "UTF-8" "ISO8859-1" > > > why ISO8859-1 might be a fallback option here? > > ISO8859-1 seems to be offered because it covers the alphabet of > American English. Obviously, this doesn't guarantee that the guess is > correct. For example, I could symlink the ru_RU.KOI8-R locale on my > system to name it "ru_RU", and localeToCharset() would return > "ISO8859-5", not knowing the correct answer. ??????, anyone? > > > Part of my analysis of the code of `localeToCharset()` is that it > > targets special scenarios on Windows and macOS, but not on Linux. > > Well, it almost does the right thing. GNU/Linux locales are typically > named like <language>_<country>.<encoding>, and localeToCharset() > respects the <encoding> part, but only if the language and the country > are specified. A quick fix for that would be to add one final case: > > Index: src/library/utils/R/iconv.R > ==================================================================> --- src/library/utils/R/iconv.R (revision 81596) > +++ src/library/utils/R/iconv.R (working copy) > @@ -135,6 +135,7 @@ > if(enc == "utf8") return(c("UTF-8", guess(ll))) > else return(guess(ll)) > } > + if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8 > return(NA_character_) > } > } > > (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc) > && enc != "utf8") branch.) > > Maybe a better fix would be to restructure the code a bit, to always > take the encoding hint and then also try to guess if the locale looks > like it provides a language code. > > -- > Best regards, > Ivan > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel