thr3ads.net - R devel - [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2019-Apr-10 15:45 UTC

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 10/04/2019 10:29 a.m., Yihui Xie wrote:> Since it is "technically easy" to disable the best fit conversion
and
> the best fit is rarely good, how about providing an option for
> code/package authors to disable it? I'm asking because this is one of
> the most painful issues in packages that may need to source() code
> containing UTF-8 characters that are not representable in the Windows
> native encoding. Examples include knitr/rmarkdown and shiny. Basically
> users won't be able to knit documents or run Shiny apps correctly when
> the code contains characters that cannot be represented in the native
> encoding.
Wouldn't things be worse with it disabled than currently?  I'd expect 
the line containing the "?" to end up as NA instead of converting to
"r".

Of course, it would be best to be able to declare source files as UTF-8 
and avoid any conversion at all, but as Tomas said, that's a lot harder.

Duncan Murdoch
> 
> Regards,
> Yihui
> --
> https://yihui.name
> 
> On Wed, Apr 10, 2019 at 6:36 AM Tomas Kalibera <tomas.kalibera at
gmail.com> wrote:
>>
>> On 4/10/19 1:14 PM, Jeroen Ooms wrote:
>>> On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at
gmail.com> wrote:
>>>> Minimalistic example:
>>>> Let's type "?" (LATIN SMALL LETTER R WITH CARON)
in RGui console:
>>>>> "?"
>>>> [1] "r"
>>>>
>>>> Although the script is in UTF-8, the characters are replaced by
>>>> "simplified" substitutes uncontrollably (depending on
OS locale). The
>>>> same goes with simply entering the code statements in R
Console.
>>>>
>>>> The problem does not occur on OS with UTF-8 locale (Mac OS,
Linux...)
>>> I think this is a "feature" of win_iconv that is bundled
with base R
>>> on Windows (./src/extra/win_iconv). The character from your example
is
>>> not part of the latin1 (iso-8859-1) set, however, win-iconv seems
to
>>> do so anyway:
>>>
>>>> x <- "\U00159"
>>>> print(x)
>>> [1] "?"
>>>> iconv(x, 'UTF-8', 'iso-8859-1')
>>> [1] "r"
>>>
>>> On MacOS, iconv tells us this character cannot be represented as
latin1:
>>>
>>>> x <- "\U00159"
>>>> print(x)
>>> [1] "?"
>>>> iconv(x, 'UTF-8', 'iso-8859-1')
>>> [1] NA
>>>
>>> I'm actually not sure why base-R needs win_iconv (but I'm
not an
>>> encoding expert at all). Perhaps we could try to unbundle it and
use
>>> the standard libiconv provided by the Rtools toolchain bundle to
get
>>> more consistent results.
>>
>> win_iconv just calls into Windows API to do the conversion, it is
>> technically easy to disable the "best fit" conversion, but I
think it
>> won't be a good idea. In some cases, perhaps rare, the best fit is
good,
>> actually including the conversion from "?" to "r"
which makes perfect
>> sense. But more importantly, changing the behavior could affect users
>> who expect the substitution to happen because it has been happening for
>> many years, and it won't help others much.
>>
>> Tomas
>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Jeroen Ooms

2019-Apr-10 16:32 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at
gmail.com> wrote:>
> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
> > Since it is "technically easy" to disable the best fit
conversion and
> > the best fit is rarely good, how about providing an option for
> > code/package authors to disable it? I'm asking because this is one
of
> > the most painful issues in packages that may need to source() code
> > containing UTF-8 characters that are not representable in the Windows
> > native encoding. Examples include knitr/rmarkdown and shiny. Basically
> > users won't be able to knit documents or run Shiny apps correctly
when
> > the code contains characters that cannot be represented in the native
> > encoding.
>
> Wouldn't things be worse with it disabled than currently?  I'd
expect
> the line containing the "?" to end up as NA instead of converting
to "r".
I don't think it would be worse, because in this case R would not
implicitly convert strings to (best fit) latin1 on Windows, but
instead keep the (correct) string in its UTF-8 encoding. The NA only
appears if the user explicitly forces a conversion to latin1, which is
not the problem here I think.

The original problem that I can reproduce in RGui is that if you enter
 "?" in RGui, R opportunistically converts this to latin1, because it
can. However if you enter text which can definitely not be represented
in latin1, R encodes the string correctly in UTF-8 form.

Duncan Murdoch

2019-Apr-10 16:46 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 10/04/2019 12:32 p.m., Jeroen Ooms wrote:> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at
gmail.com> wrote:
>>
>> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
>>> Since it is "technically easy" to disable the best fit
conversion and
>>> the best fit is rarely good, how about providing an option for
>>> code/package authors to disable it? I'm asking because this is
one of
>>> the most painful issues in packages that may need to source() code
>>> containing UTF-8 characters that are not representable in the
Windows
>>> native encoding. Examples include knitr/rmarkdown and shiny.
Basically
>>> users won't be able to knit documents or run Shiny apps
correctly when
>>> the code contains characters that cannot be represented in the
native
>>> encoding.
>>
>> Wouldn't things be worse with it disabled than currently?  I'd
expect
>> the line containing the "?" to end up as NA instead of
converting to "r".
> 
> I don't think it would be worse, because in this case R would not
> implicitly convert strings to (best fit) latin1 on Windows, but
> instead keep the (correct) string in its UTF-8 encoding. The NA only
> appears if the user explicitly forces a conversion to latin1, which is
> not the problem here I think.
> 
> The original problem that I can reproduce in RGui is that if you enter
>   "?" in RGui, R opportunistically converts this to latin1,
because it
> can. However if you enter text which can definitely not be represented
> in latin1, R encodes the string correctly in UTF-8 form.
> 
I think the pathways for text in RGui and text being sourced are 
different.  I agree fixing RGui in that way would make sense, but Yihui 
was talking about source().

Duncan Murdoch

Tomas Kalibera

2019-Apr-11 06:25 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 4/10/19 6:32 PM, Jeroen Ooms wrote:> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at
gmail.com> wrote:
>> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
>>> Since it is "technically easy" to disable the best fit
conversion and
>>> the best fit is rarely good, how about providing an option for
>>> code/package authors to disable it? I'm asking because this is
one of
>>> the most painful issues in packages that may need to source() code
>>> containing UTF-8 characters that are not representable in the
Windows
>>> native encoding. Examples include knitr/rmarkdown and shiny.
Basically
>>> users won't be able to knit documents or run Shiny apps
correctly when
>>> the code contains characters that cannot be represented in the
native
>>> encoding.
>> Wouldn't things be worse with it disabled than currently?  I'd
expect
>> the line containing the "?" to end up as NA instead of
converting to "r".
> I don't think it would be worse, because in this case R would not
> implicitly convert strings to (best fit) latin1 on Windows, but
> instead keep the (correct) string in its UTF-8 encoding. The NA only
> appears if the user explicitly forces a conversion to latin1, which is
> not the problem here I think.
>
> The original problem that I can reproduce in RGui is that if you enter
>   "?" in RGui, R opportunistically converts this to latin1,
because it
> can. However if you enter text which can definitely not be represented
> in latin1, R encodes the string correctly in UTF-8 form.
Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs
to
convert the input to native encoding before passing it to R, which is 
based on locales. However, that string is passed by R to the parser, 
which Rgui takes advantage of and converts non-representable characters 
to their \uxxxx escapes which are understood by the parser. Using this 
trick, Unicode characters can get to the parser from Rgui (but of course 
then still in risk of conversion later when the program runs). Rgui only 
escapes characters that cannot be represented, unfortunately, the 
standard C99 API for that implemented on Windows does the best fit. This 
could be fixed in Rgui by calling a special Windows API function and 
could be done, but with the mentioned risk that it would break existing 
uses that capture the existing behavior.

This is the only place I know of where removing best fit would lead to 
correct representation of UTF-8 characters. Other places will give NA, 
some other escapes, code will fail to parse (e.g. "incomplete string",
one can get that easily with source()).

Tomas

Reasonably Related Threads

Search for more apparently analagous threads

R devel - Apr 2019 - R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Reasonably Related Threads