thr3ads.net - R devel - [Rd] localeToCharset() [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Blätte, Andreas

2022-Jan-31 09:56 UTC

[Rd] localeToCharset()

Dear all,

packages for processing text may need information on the charset of the R
session. In my packages RcppCWB and polmineR, I extract this information from
the locale using `localeToCharset()`. But when running cross-platform checks
(Github Actions and Docker), I recurringly encounter unexpected behavior of
`localeToCharset()`.

As a a reproducible example, I suggest to use a local Fedora (latest) container,
starting as follows:

docker pull fedora:latest
docker run -it fedora:latest /bin/bash

After installing R (`yum install -y R`) and starting R, `localeToCharset()`
returns `NA`. However, the part of sessionInfo() on the locale is as follows:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

If I run R CMD check on any arbitrary package in this environment at this stage,
I see:
* using session charset: UTF-8

The documentation says however: ?In the C locale the answer will be
"ASCII".?  Why not UTF-8 in this case?

The `localeToCharset()` function is also confusing for me, when I explicitly
re-define the locale. In my fresh Fedora docker container, I need to install
English-language locales first:
dnf install langpacks-en

After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 R`,  the
output of `localeToCharset()` is:
[1] "UTF-8"     "ISO8859-1"

The ?Value? section of the documentation says: ?A character vector naming an
encoding and possibly a fallback single-encoding, NA if unknown.?  But I do not
understand why ISO8859-1 might be a fallback option here?

I do not know whether this is just a matter of documentation? My intuition is
that `localeToCharset()` should work differently. At the moment, I need to rely
on a few workarounds to cope with the behavior I do not understand.  (Or is
there a better function to detect the encoding of the R session?)

Part of my analysis of the code of `localeToCharset()` is that it targets
special scenarios on Windows and macOS, but not on Linux.

Kind regards
Andreas

--
Prof. Dr. Andreas Blaette
Professor of Public Policy and Regional Politics
University of Duisburg-Essen



	[[alternative HTML version deleted]]

Rasmus Liland

2022-Jan-31 10:35 UTC

head link

[Rd] localeToCharset()

Dear Andreas,

I think your R session is able to use 
correct unicode from your Fedora docker 
container, because I get the same output 
from `localeToCharset()` on my similarly 
configured ArchLinux system.

I found some notes on setting locale in 
Fedora [1]

On ArchLinux, I set my locale globally 
in /etc/locale.conf, since a long time 
ago

	LANG=en_GB.UTF-8
	LC_CTYPE="en_GB.UTF-8"
	LC_NUMERIC="en_GB.UTF-8"
	LC_TIME="en_DK.UTF-8"
	LC_COLLATE="en_GB.UTF-8"
	LC_MONETARY="nb_NO.UTF-8"
	LC_PAPER="nb_NO.UTF-8"
	LC_NAME="nb_NO.UTF-8"
	LC_ADDRESS="nb_NO.UTF-8"
	LC_TELEPHONE="nb_NO.UTF-8"
	LC_MEASUREMENT="nb_NO.UTF-8"
	LC_INDENTIFICATION="nb_NO.UTF-8"

On FreeBSD, it is customary to set the 
locale locally in your ~/.login_conf

	me:\
		:charset=UTF-8:\
		:lang=en_GB.UTF-8:\
		:setenv=LC_COLLATE=C:

OpenBSD has UTF-8 enabled for everything 
all of the time (I'm thinking), coming 
from a state not long ago where it was 
largely unavailable ...

Best,
Rasmus

[1]
https://docs.fedoraproject.org/en-US/Fedora/26/html/System_Administrators_Guide/ch-System_Locale_and_Keyboard_Configuration.html

Ivan Krylov

2022-Jan-31 11:32 UTC

head link

[Rd] localeToCharset()

On Mon, 31 Jan 2022 09:56:27 +0000
"Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote:
> After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
> R`,  the output of `localeToCharset()` is:
> [1] "UTF-8"     "ISO8859-1"
> why ISO8859-1 might be a fallback option here?
ISO8859-1 seems to be offered because it covers the alphabet of
American English. Obviously, this doesn't guarantee that the guess is
correct. For example, I could symlink the ru_RU.KOI8-R locale on my
system to name it "ru_RU", and localeToCharset() would return
"ISO8859-5", not knowing the correct answer. ??????, anyone?
> Part of my analysis of the code of `localeToCharset()` is that it
> targets special scenarios on Windows and macOS, but not on Linux.
Well, it almost does the right thing. GNU/Linux locales are typically
named like <language>_<country>.<encoding>, and
localeToCharset()
respects the <encoding> part, but only if the language and the country
are specified. A quick fix for that would be to add one final case:

Index: src/library/utils/R/iconv.R
==================================================================---
src/library/utils/R/iconv.R (revision 81596)
+++ src/library/utils/R/iconv.R (working copy)
@@ -135,6 +135,7 @@
             if(enc == "utf8") return(c("UTF-8", guess(ll)))
             else return(guess(ll))
         }
+        if (enc == "utf8") return("UTF-8") # fallback for
???.UTF-8
         return(NA_character_)
     }
 }

(Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
&& enc != "utf8") branch.)

Maybe a better fix would be to restructure the code a bit, to always
take the encoding hint and then also try to guess if the locale looks
like it provides a language code.

-- 
Best regards,
Ivan

R devel - Jan 2022 - localeToCharset()

[Rd] localeToCharset()

[Rd] localeToCharset()

[Rd] localeToCharset()