Dear all, packages for processing text may need information on the charset of the R session. In my packages RcppCWB and polmineR, I extract this information from the locale using `localeToCharset()`. But when running cross-platform checks (Github Actions and Docker), I recurringly encounter unexpected behavior of `localeToCharset()`. As a a reproducible example, I suggest to use a local Fedora (latest) container, starting as follows: docker pull fedora:latest docker run -it fedora:latest /bin/bash After installing R (`yum install -y R`) and starting R, `localeToCharset()` returns `NA`. However, the part of sessionInfo() on the locale is as follows: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C If I run R CMD check on any arbitrary package in this environment at this stage, I see: * using session charset: UTF-8 The documentation says however: ?In the C locale the answer will be "ASCII".? Why not UTF-8 in this case? The `localeToCharset()` function is also confusing for me, when I explicitly re-define the locale. In my fresh Fedora docker container, I need to install English-language locales first: dnf install langpacks-en After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 R`, the output of `localeToCharset()` is: [1] "UTF-8" "ISO8859-1" The ?Value? section of the documentation says: ?A character vector naming an encoding and possibly a fallback single-encoding, NA if unknown.? But I do not understand why ISO8859-1 might be a fallback option here? I do not know whether this is just a matter of documentation? My intuition is that `localeToCharset()` should work differently. At the moment, I need to rely on a few workarounds to cope with the behavior I do not understand. (Or is there a better function to detect the encoding of the R session?) Part of my analysis of the code of `localeToCharset()` is that it targets special scenarios on Windows and macOS, but not on Linux. Kind regards Andreas -- Prof. Dr. Andreas Blaette Professor of Public Policy and Regional Politics University of Duisburg-Essen [[alternative HTML version deleted]]
Dear Andreas, I think your R session is able to use correct unicode from your Fedora docker container, because I get the same output from `localeToCharset()` on my similarly configured ArchLinux system. I found some notes on setting locale in Fedora [1] On ArchLinux, I set my locale globally in /etc/locale.conf, since a long time ago LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_DK.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="nb_NO.UTF-8" LC_PAPER="nb_NO.UTF-8" LC_NAME="nb_NO.UTF-8" LC_ADDRESS="nb_NO.UTF-8" LC_TELEPHONE="nb_NO.UTF-8" LC_MEASUREMENT="nb_NO.UTF-8" LC_INDENTIFICATION="nb_NO.UTF-8" On FreeBSD, it is customary to set the locale locally in your ~/.login_conf me:\ :charset=UTF-8:\ :lang=en_GB.UTF-8:\ :setenv=LC_COLLATE=C: OpenBSD has UTF-8 enabled for everything all of the time (I'm thinking), coming from a state not long ago where it was largely unavailable ... Best, Rasmus [1] https://docs.fedoraproject.org/en-US/Fedora/26/html/System_Administrators_Guide/ch-System_Locale_and_Keyboard_Configuration.html
On Mon, 31 Jan 2022 09:56:27 +0000 "Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote:> After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 > R`, the output of `localeToCharset()` is: > [1] "UTF-8" "ISO8859-1"> why ISO8859-1 might be a fallback option here?ISO8859-1 seems to be offered because it covers the alphabet of American English. Obviously, this doesn't guarantee that the guess is correct. For example, I could symlink the ru_RU.KOI8-R locale on my system to name it "ru_RU", and localeToCharset() would return "ISO8859-5", not knowing the correct answer. ??????, anyone?> Part of my analysis of the code of `localeToCharset()` is that it > targets special scenarios on Windows and macOS, but not on Linux.Well, it almost does the right thing. GNU/Linux locales are typically named like <language>_<country>.<encoding>, and localeToCharset() respects the <encoding> part, but only if the language and the country are specified. A quick fix for that would be to add one final case: Index: src/library/utils/R/iconv.R ==================================================================--- src/library/utils/R/iconv.R (revision 81596) +++ src/library/utils/R/iconv.R (working copy) @@ -135,6 +135,7 @@ if(enc == "utf8") return(c("UTF-8", guess(ll))) else return(guess(ll)) } + if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8 return(NA_character_) } } (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc) && enc != "utf8") branch.) Maybe a better fix would be to restructure the code a bit, to always take the encoding hint and then also try to guess if the locale looks like it provides a language code. -- Best regards, Ivan