thr3ads.net - R devel - [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Tomáš Bořil

2019-Apr-10 08:22 UTC

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Hello,

There is a long-lasting problem with processing UTF-8 source code in R
on Windows OS. As Windows do not have "UTF-8" locale and R passes
source code through OS before executing it, some characters are
"simplified" by the OS before processing, leading to undesirable
changes.

Minimalistic example:
Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui
console:> "?"[1] "r"

Let's assume the following script:
# file [script.R]
if ("?" != "\U00159") {
    stop("Problem: Unexpected character conversion.")
} else {
    cat("o.k.\n")
}

Problem:
source("script.R", encoding = "UTF-8")

OK (see
https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
eval(parse("script.R", encoding = "UTF-8"))

Although the script is in UTF-8, the characters are replaced by
"simplified" substitutes uncontrollably (depending on OS locale). The
same goes with simply entering the code statements in R Console.

The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)

Best regards
Tomas Boril
> R.version               _
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
system         x86_64, mingw32
status         alpha
major          3
minor          6.0
year           2019
month          04
day            07
svn rev        76333
language       R
version.string R version 3.6.0 alpha (2019-04-07 r76333)
nickname
> Sys.getlocale()[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Tomas Kalibera

2019-Apr-10 11:10 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 4/10/19 10:22 AM, Tom?? Bo?il wrote:> Hello,
>
> There is a long-lasting problem with processing UTF-8 source code in R
> on Windows OS. As Windows do not have "UTF-8" locale and R passes
> source code through OS before executing it, some characters are
> "simplified" by the OS before processing, leading to undesirable
> changes.
>
> Minimalistic example:
> Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui
console:
>> "?"
> [1] "r"
>
> Let's assume the following script:
> # file [script.R]
> if ("?" != "\U00159") {
>      stop("Problem: Unexpected character conversion.")
> } else {
>      cat("o.k.\n")
> }
>
> Problem:
> source("script.R", encoding = "UTF-8")
>
> OK (see
https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
> eval(parse("script.R", encoding = "UTF-8"))
On my system with your example,
>  source("t.r")Error in eval(ei, envir) : Problem: Unexpected character
conversion.>  source("/Users/tomas/t.r", encoding="UTF-8")Error in eval(ei, envir) : Problem: Unexpected character
conversion..>  eval(parse("t.r", encoding="UTF-8"))o.k.

Which is expected, unfortunately. As per documentation of ?source, the 
"encoding" argument tells source() that the input is in UTF-8, so that
source() can convert it to the native encoding. Again as documented, 
parse() uses its encoding argument to mark the encoding of the strings, 
but it does not re-encode, and the character strings in the parsed 
result will as documented have the encoding mark (UTF-8 in this
case).> Although the script is in UTF-8, the characters are replaced by
> "simplified" substitutes uncontrollably (depending on OS locale).
The
> same goes with simply entering the code statements in R Console.
>
> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)
Yes. By default, Windows uses "best fit" when translating characters
to
the native encoding. This could be changed in principle, but could break 
existing applications that may depend on it, and it won't really help 
because such characters cannot be represented anyway. You can find more 
in ?Encoding, but yes, it is a known problem frequently encountered by 
users and unless Windows starts supporting UTF-8 as native encoding, 
there is no easy fix (a version from Windows 10 Insider preview supports 
it, so maybe that is not completely hopeless). In theory you can 
carefully read the documentation and use only functions that can work 
with UTF-8 without converting to native encoding, but pragmatically, if 
you want to work with UTF-8 files in R, it is best to use a non-Windows 
platform.

Best
Tomas
>
> Best regards
> Tomas Boril
>
>> R.version
>                 _
> platform       x86_64-w64-mingw32
> arch           x86_64
> os             mingw32
> system         x86_64, mingw32
> status         alpha
> major          3
> minor          6.0
> year           2019
> month          04
> day            07
> svn rev        76333
> language       R
> version.string R version 3.6.0 alpha (2019-04-07 r76333)
> nickname
>
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


	[[alternative HTML version deleted]]

Jeroen Ooms

2019-Apr-10 11:14 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com>
wrote:>
> Minimalistic example:
> Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui
console:
> > "?"
> [1] "r"
>
> Although the script is in UTF-8, the characters are replaced by
> "simplified" substitutes uncontrollably (depending on OS locale).
The
> same goes with simply entering the code statements in R Console.
>
> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)
I think this is a "feature" of win_iconv that is bundled with base R
on Windows (./src/extra/win_iconv). The character from your example is
not part of the latin1 (iso-8859-1) set, however, win-iconv seems to
do so anyway:
> x <- "\U00159"
> print(x)
[1] "?"> iconv(x, 'UTF-8', 'iso-8859-1')[1] "r"

On MacOS, iconv tells us this character cannot be represented as latin1:
> x <- "\U00159"
> print(x)
[1] "?"> iconv(x, 'UTF-8', 'iso-8859-1')[1] NA

I'm actually not sure why base-R needs win_iconv (but I'm not an
encoding expert at all). Perhaps we could try to unbundle it and use
the standard libiconv provided by the Rtools toolchain bundle to get
more consistent results.

Tomas Kalibera

2019-Apr-10 11:26 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 4/10/19 1:14 PM, Jeroen Ooms wrote:> On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com>
wrote:
>> Minimalistic example:
>> Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui
console:
>>> "?"
>> [1] "r"
>>
>> Although the script is in UTF-8, the characters are replaced by
>> "simplified" substitutes uncontrollably (depending on OS
locale). The
>> same goes with simply entering the code statements in R Console.
>>
>> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)
> I think this is a "feature" of win_iconv that is bundled with
base R
> on Windows (./src/extra/win_iconv). The character from your example is
> not part of the latin1 (iso-8859-1) set, however, win-iconv seems to
> do so anyway:
>
>> x <- "\U00159"
>> print(x)
> [1] "?"
>> iconv(x, 'UTF-8', 'iso-8859-1')
> [1] "r"
>
> On MacOS, iconv tells us this character cannot be represented as latin1:
>
>> x <- "\U00159"
>> print(x)
> [1] "?"
>> iconv(x, 'UTF-8', 'iso-8859-1')
> [1] NA
>
> I'm actually not sure why base-R needs win_iconv (but I'm not an
> encoding expert at all). Perhaps we could try to unbundle it and use
> the standard libiconv provided by the Rtools toolchain bundle to get
> more consistent results.
win_iconv just calls into Windows API to do the conversion, it is 
technically easy to disable the "best fit" conversion, but I think it 
won't be a good idea. In some cases, perhaps rare, the best fit is good, 
actually including the conversion from "?" to "r" which
makes perfect
sense. But more importantly, changing the behavior could affect users 
who expect the substitution to happen because it has been happening for 
many years, and it won't help others much.

Tomas
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Tomáš Bořil

2019-Apr-10 16:13 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Yes, again in a script sourced by source(encoding = ...). But also by
typing it directly in R console.

Most of the time, I use RStudio as a front-end. For this experiment, I
also verified it in Rgui. In both front-ends, it behaves completely in
the same way.

An optional parameter to source() function which would translate all
UTF-8 characters in string literals to their "\Uxxxx" codes sounds as
a great idea (and I hope it would fix 99.9% of problems I have -
because that is the way I overcome these problems nowadays) - and the
same behaviour in command line...

Tomas
> What do you mean it is "converted before"? Under what context?
Again a
> script sourced by source(encoding=) ?
>
> And, are you using Rgui as front-end?
>>   The only problem is that I
>> cannot simple use enc2utf8("?") - it is converted to
"o" before
>> executing the function. Instead of that, I have to explicitly type
>> "\U00159" throughout my code.On Wed, Apr 10, 2019 at 5:29 PM Tomas Kalibera <tomas.kalibera at
gmail.com> wrote:>
> On 4/10/19 3:02 PM, Tom?? Bo?il wrote:
> > The thing is, I would rather prefer R (in that rare occasions where an
> > old function does not support anything but ANSI encoding) throwing  an
> > error:
> > "Unicode encoding not supported, please change the string in your
> > code" instead of silently converting some characters to different
ones
> > without any warning.
> In principle it probably could be optional as Yihui Xie asks on R-devel,
> we will discuss that internally. If the Windows "best fit" is a
big
> problem on its own, this is something that could be done quickly, if
> optional. We could turn into error only conversions that we have control
> of (inside R code), indeed, but that should be most.
> > I understand that there are some functions which are not
> > Unicode-compatible yet but according to the Stackoverflow discussion I
> > cited before, in many cases (90% or more?) everything works right with
> > Encoding("\U00159") == "UTF-8" (in my scripts, I
have not found any
> > problem with explicit UTF-8 coding yet).
>
> Well there has been a lot of effort invested to make that possible, so
> that many internal string functions do not convert unnecessarily into
> UTF-8, mostly by Duncan Murdoch, but much more needs to be done and
> there is the problem with packages. Of course if you find a concrete R
> function that unnecessarily converts (source() is debatable, I know
> about it, so some other), you are welcome to report, I or someone can
> fix. A common problem is I/O (connections) and there the fix won't be
> easy, it would have to be re-designed. The problem is that when we have
> something typed "char *" inside R, it needs to be always in
native
> encoding, any mix would lead to total chaos.
>
> The full solution would however only be fully switching to UTF-8
> internally on Windows (and then char * would always mean UTF-8), we have
> discussed this many times inside R Core (and many times before I
> joined), I am sure it will be discussed again at some point and we are
> aware of course of the problem. Please trust us it is hard to do - we
> know the code as we (collectively) have written it. People contributing
> to SO are users and package developers, not developers of the core. You
> can get more correct information from people on R-devel (package
> developers and sometimes core developers).
>
> >   The only problem is that I
> > cannot simple use enc2utf8("?") - it is converted to
"o" before
> > executing the function. Instead of that, I have to explicitly type
> > "\U00159" throughout my code.
>
> What do you mean it is "converted before"? Under what context?
Again a
> script sourced by source(encoding=) ?
>
> And, are you using Rgui as front-end?
>
> > In my lectures, I have Czech, Russian and English students and it is
> > also impossible to create a script that works for everyone. In fact, I
> > know that Czech "?" can be translated to my native (Czech)
encoding. I
> > have just chosen the example as it is reproducible in English locale.
>
>
> > Originally, I had a problem with IPA characted (phonetic symbol)
"?",
> > i.e. "\U00153". In Czech locale, it is translated to
"o". In English,
> > it is not converted - it remains "?". But if I use
"\U00153" in Czech
> > locale, nothing is converted and everything works right.
>
> Yes, the \u* sequence I hear is commonly used to represent UTF-8 string
> literals in something that is not UTF-8 itself. Note if you have a
> package, you can have R source files with UTF-8 encoded literal strings
> if you declare Encoding: UTF-8 in the DESCRIPTION file (see Writing R
> Extensions for details), even though sometimes people run into
> trouble/bugs as well.
>
> You probably know none of these problems exist on Linux nor macOS, where
> UTF-8 is the native encoding.
>
> Tomas
>
> >
> > Tomas
> >
> >
> >
> > On Wed, Apr 10, 2019 at 2:37 PM Tomas Kalibera <tomas.kalibera at
gmail.com> wrote:
> >> On 4/10/19 2:06 PM, Tom?? Bo?il wrote:
> >>
> >> Thank you for the explanation but I just do not understand one
thing - why it would need to  recreate the R from a scratch to work with Unicode
internally?
> >>
> >> If I call the script with
> >> eval(parse("script.R", encoding = "UTF-8"))
> >> it works perfectly - it looks like R functions already support
Unicode. When I type "\U00159", R also has no problem with that.
> >>
> >> Well there is support for unicode, but the problem is that at some
point translation to native encoding is needed. The parser does not do that,
nothing you call in your example script does it, but many other functions do.
Note that you can use UTF-8 without problems as long as you only have characters
that can be represented also in the current native encoding. So, if you run in a
Czech locale, Czech characters in UTF-8 will work fine, just they will sometimes
be translated to corresponding Czech characters in your native encoding.
> >>
> >> If you want to learn more about encodings in R, look at ?Encoding,
Writing R Extensions, etc. In principle, ever R object representing a string has
a flag whether the string is in UTF-8, in latin1, or in current native encoding.
But C structures typed "char *" almost always are in current native
encoding, any mixture would lead to chaos. Most functions operating on strings
have to specially handle UTF-8, MBCS encodings, ASCII, etc. All of that would
have to be rewritten. Many Windows API calls are still using the native encoding
version (some can use UTF16-LE via conversion from UTF-8 or other encodings).
> >>
> >> In principle, it should work to have UTF-8 coded string constants
in R programs, and definitely so if you use \uxxxx (see Writing R Extensions for
details). But you should always run in a native encoding where these characters
can be represented, otherwise it may or may not work, depending on which
functions you call.
> >>
> >> Tomas
> >>
> >>
> >> Thanks,
> >> Tomas
> >>
> >> st 10. 4. 2019 v 13:52 odes?latel Tomas Kalibera
<tomas.kalibera at gmail.com> napsal:
> >>> On 4/10/19 1:35 PM, Tom?? Bo?il wrote:
> >>>> Which users make their code depending on an automatic
conversion which
> >>>> behaves differently in each Europe country, but only on
Windows?
> >>> I meant the "best fit". The same R scripts for the
same data sets would
> >>> be returning different results, people capture existing
behavior without
> >>> necessarily knowing about it. Removing the "best
fit" would not remove
> >>> the translation to native encoding, you would get NA or some
escape
> >>> sequence/character code number instead of the "best
fit" character.  It
> >>> would not solve the problem.
> >>>
> >>> The real problem is that the conversion to native encoding
happens. This
> >>> question has been discussed many times before, but in short,
it would
> >>> take probably many 1000s of hours of developer time to rewrite
R to use
> >>> UTF-8 internally, but convert to UTF16-LE in all Windows API
calls. It
> >>> will cause changes to documented behavior. What may not be
obvious,
> >>> there is a problem with package code written in C/C++ that
ignores
> >>> encoding flags (that is almost all native code in packages).
That code
> >>> will stop working and there will be no way to test - because
the input
> >>> data in the contributed examples/tests are ASCII.
> >>>
> >>> If Windows start supporting UTF-8 as native encoding, the fix
will be a
> >>> lot easier (I hope ~100hours), and without the compatibility
problems -
> >>> just users who would wish to use UTF-8 as native encoding will
be
> >>> affected, and things will probably work for them even with
poorly
> >>> written packages.
> >>>
> >>> Tomas
> >>>
> >>>
> >>>> If someone needs the explicit conversion, he can call the
iconv() function.
> >>>>
> >>>> Much more people using R for text processing are
frustrated they can
> >>>> code only in ASCII (0-255), even though their code is
saved in
> >>>> Unicode.
> >>>>
> >>>> Tomas
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 10, 2019 at 1:26 PM Tomas Kalibera
<tomas.kalibera at gmail.com> wrote:
> >>>>> On 4/10/19 1:14 PM, Jeroen Ooms wrote:
> >>>>>> On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il
<borilt at gmail.com> wrote:
> >>>>>>> Minimalistic example:
> >>>>>>> Let's type "?" (LATIN SMALL
LETTER R WITH CARON) in RGui console:
> >>>>>>>> "?"
> >>>>>>> [1] "r"
> >>>>>>>
> >>>>>>> Although the script is in UTF-8, the
characters are replaced by
> >>>>>>> "simplified" substitutes
uncontrollably (depending on OS locale). The
> >>>>>>> same goes with simply entering the code
statements in R Console.
> >>>>>>>
> >>>>>>> The problem does not occur on OS with UTF-8
locale (Mac OS, Linux...)
> >>>>>> I think this is a "feature" of win_iconv
that is bundled with base R
> >>>>>> on Windows (./src/extra/win_iconv). The character
from your example is
> >>>>>> not part of the latin1 (iso-8859-1) set, however,
win-iconv seems to
> >>>>>> do so anyway:
> >>>>>>
> >>>>>>> x <- "\U00159"
> >>>>>>> print(x)
> >>>>>> [1] "?"
> >>>>>>> iconv(x, 'UTF-8',
'iso-8859-1')
> >>>>>> [1] "r"
> >>>>>>
> >>>>>> On MacOS, iconv tells us this character cannot be
represented as latin1:
> >>>>>>
> >>>>>>> x <- "\U00159"
> >>>>>>> print(x)
> >>>>>> [1] "?"
> >>>>>>> iconv(x, 'UTF-8',
'iso-8859-1')
> >>>>>> [1] NA
> >>>>>>
> >>>>>> I'm actually not sure why base-R needs
win_iconv (but I'm not an
> >>>>>> encoding expert at all). Perhaps we could try to
unbundle it and use
> >>>>>> the standard libiconv provided by the Rtools
toolchain bundle to get
> >>>>>> more consistent results.
> >>>>> win_iconv just calls into Windows API to do the
conversion, it is
> >>>>> technically easy to disable the "best fit"
conversion, but I think it
> >>>>> won't be a good idea. In some cases, perhaps rare,
the best fit is good,
> >>>>> actually including the conversion from "?"
to "r" which makes perfect
> >>>>> sense. But more importantly, changing the behavior
could affect users
> >>>>> who expect the substitution to happen because it has
been happening for
> >>>>> many years, and it won't help others much.
> >>>>>
> >>>>> Tomas
> >>>>>
> >>>>>> ______________________________________________
> >>>>>> R-devel at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

Tomas Kalibera

2019-Apr-11 06:10 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 4/10/19 6:13 PM, Tom?? Bo?il wrote:
> An optional parameter to source() function which would translate all
> UTF-8 characters in string literals to their "\Uxxxx" codes
sounds as
> a great idea (and I hope it would fix 99.9% of problems I have -
> because that is the way I overcome these problems nowadays) - and the
> same behaviour in command line...
I was not suggesting to convert to \Uxxxx in source(). Some users do it 
in their programs by hand or an external utility. Source() in principle 
could be made work similarly to eval(parse(file,encoding=)) with respect 
to encodings, via other means, we will consider that but there are many 
remaining places where the conversion happens - a trivial one is that 
currently you cannot print the result of the parse() from your example 
properly. Maybe you don't trigger such problems in your scripts in 
obvious ways, but as I said before, if you want to work reliably with 
characters not representable in current native encoding, in current or 
near version of R, use Linux or macOS.

Tomas
>
> Tomas

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Apr 2019 - R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Apparently Analagous Threads