thr3ads.net - R devel - [Rd] localeToCharset() [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Blätte, Andreas

2022-Jan-31 11:38 UTC

[Rd] localeToCharset()

Dear Ivan,

this is a very helpful explanation!  I think it is important to make output of
localeToCharset() more predictable. My problem is essentially not to set the
locale such that things will work after all. I think the problem is that you see
unexpected results.  I guess I owe a suggestion how to improve the code, but
your suggestion looks like a very good starting point.

Andreas 

?Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at
gmail.com>:

    On Mon, 31 Jan 2022 09:56:27 +0000
    "Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote:

    > After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
    > R`,  the output of `localeToCharset()` is:
    > [1] "UTF-8"     "ISO8859-1"

    > why ISO8859-1 might be a fallback option here?

    ISO8859-1 seems to be offered because it covers the alphabet of
    American English. Obviously, this doesn't guarantee that the guess is
    correct. For example, I could symlink the ru_RU.KOI8-R locale on my
    system to name it "ru_RU", and localeToCharset() would return
    "ISO8859-5", not knowing the correct answer. ??????, anyone?

    > Part of my analysis of the code of `localeToCharset()` is that it
    > targets special scenarios on Windows and macOS, but not on Linux.

    Well, it almost does the right thing. GNU/Linux locales are typically
    named like <language>_<country>.<encoding>, and
localeToCharset()
    respects the <encoding> part, but only if the language and the country
    are specified. A quick fix for that would be to add one final case:

    Index: src/library/utils/R/iconv.R
    ==================================================================    ---
src/library/utils/R/iconv.R (revision 81596)
    +++ src/library/utils/R/iconv.R (working copy)
    @@ -135,6 +135,7 @@
                 if(enc == "utf8") return(c("UTF-8",
guess(ll)))
                 else return(guess(ll))
             }
    +        if (enc == "utf8") return("UTF-8") # fallback
for ???.UTF-8
             return(NA_character_)
         }
     }

    (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
    && enc != "utf8") branch.)

    Maybe a better fix would be to restructure the code a bit, to always
    take the encoding hint and then also try to guess if the locale looks
    like it provides a language code.

    -- 
    Best regards,
    Ivan

Simon Urbanek

2022-Jan-31 12:16 UTC

head link

[Rd] localeToCharset()

Andreas,

The output is very predictable, so this is not about predictability. Note that
C.UTF-8 is technically an invalid locale by the semantics rules (see below).
Also note that the C locale is "C" - it is not allowed to have any
string behind the C (or else is not the C locale) so what you have is NOT a C
locale (see POSIX 7.2).

The issue here is that the POSIX standard provides no semantic rules, locale
names can be arbitrary, the only defined one is C (and its synonym POSIX). All
others are random locales that can do whatever they want. Then later some
systems have introduced semantic guidelines such as the
<language>_<territory>.<codeset> convention - that that is
what localeToCharsets() expected so it can try to guess the charset for that
language. Since C.UTF-8 is such an aberration (not in the standard form)
localeToCharset() doesn't know about it and returns NA since it can't
guess the language.

Long story short, C.UTF-8 breaks all common rules and has been introduced fairly
recently to some Linux systems so R doesn't not know about it yet.
Ivan's patch fixes that. That aside, locale names have no official provision
to provide the charset, so all you get is a guess assuming the system follows
the common rules.

Cheers,
Simon

> On Feb 1, 2022, at 00:38, Bl?tte, Andreas <andreas.blaette at
uni-due.de> wrote:
> 
> Dear Ivan,
> 
> this is a very helpful explanation!  I think it is important to make output
of localeToCharset() more predictable. My problem is essentially not to set the
locale such that things will work after all. I think the problem is that you see
unexpected results.  I guess I owe a suggestion how to improve the code, but
your suggestion looks like a very good starting point.
> 
> Andreas 
> 
> ?Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at
gmail.com>:
> 
>    On Mon, 31 Jan 2022 09:56:27 +0000
>    "Bl?tte, Andreas" <andreas.blaette at uni-due.de> wrote:
> 
>> After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
>> R`,  the output of `localeToCharset()` is:
>> [1] "UTF-8"     "ISO8859-1"
> 
>> why ISO8859-1 might be a fallback option here?
> 
>    ISO8859-1 seems to be offered because it covers the alphabet of
>    American English. Obviously, this doesn't guarantee that the guess
is
>    correct. For example, I could symlink the ru_RU.KOI8-R locale on my
>    system to name it "ru_RU", and localeToCharset() would return
>    "ISO8859-5", not knowing the correct answer. ??????, anyone?
> 
>> Part of my analysis of the code of `localeToCharset()` is that it
>> targets special scenarios on Windows and macOS, but not on Linux.
> 
>    Well, it almost does the right thing. GNU/Linux locales are typically
>    named like <language>_<country>.<encoding>, and
localeToCharset()
>    respects the <encoding> part, but only if the language and the
country
>    are specified. A quick fix for that would be to add one final case:
> 
>    Index: src/library/utils/R/iconv.R
>    ==================================================================>  
--- src/library/utils/R/iconv.R (revision 81596)
>    +++ src/library/utils/R/iconv.R (working copy)
>    @@ -135,6 +135,7 @@
>                 if(enc == "utf8") return(c("UTF-8",
guess(ll)))
>                 else return(guess(ll))
>             }
>    +        if (enc == "utf8") return("UTF-8") #
fallback for ???.UTF-8
>             return(NA_character_)
>         }
>     }
> 
>    (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
>    && enc != "utf8") branch.)
> 
>    Maybe a better fix would be to restructure the code a bit, to always
>    take the encoding hint and then also try to guess if the locale looks
>    like it provides a language code.
> 
>    -- 
>    Best regards,
>    Ivan
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Tomas Kalibera

2022-Jan-31 12:32 UTC

head link

[Rd] localeToCharset()

Hi Andreas,

is there still any higher-level problem left you need to solve? Ideally 
one wouldn't need to query what is the native encoding, but directly use 
iconv() or indirectly other R functions to convert the data from/to the 
native encoding. iconv() will find out internally what is the native 
encoding (via data that is available also by l10n_info(), but with care 
for differences between OSes).

Best
Tomas

On 1/31/22 12:38, Bl?tte, Andreas wrote:> Dear Ivan,
>
> this is a very helpful explanation!  I think it is important to make output
of localeToCharset() more predictable. My problem is essentially not to set the
locale such that things will work after all. I think the problem is that you see
unexpected results.  I guess I owe a suggestion how to improve the code, but
your suggestion looks like a very good starting point.
>
> Andreas
>
> ?Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t at
gmail.com>:
>
>      On Mon, 31 Jan 2022 09:56:27 +0000
>      "Bl?tte, Andreas" <andreas.blaette at uni-due.de>
wrote:
>
>      > After starting R with a re-defined locale (`env
LC_CTYPE=en_US.UTF-8
>      > R`,  the output of `localeToCharset()` is:
>      > [1] "UTF-8"     "ISO8859-1"
>
>      > why ISO8859-1 might be a fallback option here?
>
>      ISO8859-1 seems to be offered because it covers the alphabet of
>      American English. Obviously, this doesn't guarantee that the guess
is
>      correct. For example, I could symlink the ru_RU.KOI8-R locale on my
>      system to name it "ru_RU", and localeToCharset() would
return
>      "ISO8859-5", not knowing the correct answer. ??????, anyone?
>
>      > Part of my analysis of the code of `localeToCharset()` is that it
>      > targets special scenarios on Windows and macOS, but not on Linux.
>
>      Well, it almost does the right thing. GNU/Linux locales are typically
>      named like <language>_<country>.<encoding>, and
localeToCharset()
>      respects the <encoding> part, but only if the language and the
country
>      are specified. A quick fix for that would be to add one final case:
>
>      Index: src/library/utils/R/iconv.R
>      ==================================================================>
--- src/library/utils/R/iconv.R (revision 81596)
>      +++ src/library/utils/R/iconv.R (working copy)
>      @@ -135,6 +135,7 @@
>                   if(enc == "utf8") return(c("UTF-8",
guess(ll)))
>                   else return(guess(ll))
>               }
>      +        if (enc == "utf8") return("UTF-8") #
fallback for ???.UTF-8
>               return(NA_character_)
>           }
>       }
>
>      (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
>      && enc != "utf8") branch.)
>
>      Maybe a better fix would be to restructure the code a bit, to always
>      take the encoding hint and then also try to guess if the locale looks
>      like it provides a language code.
>
>      --
>      Best regards,
>      Ivan
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Jan 2022 - localeToCharset()

[Rd] localeToCharset()

[Rd] localeToCharset()

[Rd] localeToCharset()