Dear Tomas,
thanks a lot. I do understand the explanation of Simon - I was not aware of the
standardization issue. My conclusion is that I should rely on another approach
to detect the session charset, and your suggestions are my first option.
My final thought: For users who do not know the POSIX standards and recent
aberrations , a warning might be helpful, something such as:
If (startsWith(locale, "C.")) warning (sprintf("%s is a
non-standard locale", locale))
As far as I am concerned, I take away a lot from this discussion! Thank you!
Kind regards
Andreas
?Am 31.01.22, 13:32 schrieb "Tomas Kalibera" <tomas.kalibera at
gmail.com>:
Hi Andreas,
is there still any higher-level problem left you need to solve? Ideally
one wouldn't need to query what is the native encoding, but directly use
iconv() or indirectly other R functions to convert the data from/to the
native encoding. iconv() will find out internally what is the native
encoding (via data that is available also by l10n_info(), but with care
for differences between OSes).
Best
Tomas
On 1/31/22 12:38, Bl?tte, Andreas wrote:
> Dear Ivan,
>
> this is a very helpful explanation! I think it is important to make
output of localeToCharset() more predictable. My problem is essentially not to
set the locale such that things will work after all. I think the problem is that
you see unexpected results. I guess I owe a suggestion how to improve the code,
but your suggestion looks like a very good starting point.
>
> Andreas
>
> Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at
gmail.com>:
>
> On Mon, 31 Jan 2022 09:56:27 +0000
> "Bl?tte, Andreas" <andreas.blaette at uni-due.de>
wrote:
>
> > After starting R with a re-defined locale (`env
LC_CTYPE=en_US.UTF-8
> > R`, the output of `localeToCharset()` is:
> > [1] "UTF-8" "ISO8859-1"
>
> > why ISO8859-1 might be a fallback option here?
>
> ISO8859-1 seems to be offered because it covers the alphabet of
> American English. Obviously, this doesn't guarantee that the
guess is
> correct. For example, I could symlink the ru_RU.KOI8-R locale on
my
> system to name it "ru_RU", and localeToCharset() would
return
> "ISO8859-5", not knowing the correct answer. ??????,
anyone?
>
> > Part of my analysis of the code of `localeToCharset()` is
that it
> > targets special scenarios on Windows and macOS, but not on
Linux.
>
> Well, it almost does the right thing. GNU/Linux locales are
typically
> named like <language>_<country>.<encoding>, and
localeToCharset()
> respects the <encoding> part, but only if the language and
the country
> are specified. A quick fix for that would be to add one final
case:
>
> Index: src/library/utils/R/iconv.R
> ==================================================================
> --- src/library/utils/R/iconv.R (revision 81596)
> +++ src/library/utils/R/iconv.R (working copy)
> @@ -135,6 +135,7 @@
> if(enc == "utf8")
return(c("UTF-8", guess(ll)))
> else return(guess(ll))
> }
> + if (enc == "utf8") return("UTF-8") #
fallback for ???.UTF-8
> return(NA_character_)
> }
> }
>
> (Non-UTF-8 encodings on POSIX are handled above, in the
if(nzchar(enc)
> && enc != "utf8") branch.)
>
> Maybe a better fix would be to restructure the code a bit, to
always
> take the encoding hint and then also try to guess if the locale
looks
> like it provides a language code.
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel