thr3ads.net - R devel - [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Tomáš Bořil

2019-Apr-11 07:10 UTC

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Or, if this cannot be done easily, please, disable the "utf-8" value
in source(..., ) function on Windows R.
source(..., encoding = "utf-8")
-> error: "utf-8" does not work right on Windows.
-> (or, at least) warning: "utf-8" is handled by "best
fit" on Windows
and some characters in string literals may be automatically changed.

Because, at this state, the UTF-8 encoding of R source files on
Windows is a fake Unicode as it can handle only 256 different ANSI
characters in reality.

Thanks,
Tomas


On Thu, Apr 11, 2019 at 8:53 AM Tom?? Bo?il <borilt at gmail.com>
wrote:>
> For me, this would be a perfect solution.
>
> I.e., do not use the ?best? fit and leave it to user?s competence:
> a) in some functions, utf-8 works
> b) in others -> error is thrown (e.g., incomplete string, NA, etc.)
> => user has to change the code with his/her intentional ?best fit string
literal substitute? or use another function that can handle utf-8.
>
> Making an R code working right only on some platforms / trying to keep a
back-compatibility meaning ?the code does not do what you want and the behaviour
differs depending on each every locale but at least, it does not throw an error?
is generally not a good idea - it is dangerous. Users / coders should know that
there is something wrong with their strings and some characters are ?eaten
alive?.
>
> Tomas
>
> ?t 11. 4. 2019 v 8:26 odes?latel Tomas Kalibera <tomas.kalibera at
gmail.com> napsal:
>>
>> On 4/10/19 6:32 PM, Jeroen Ooms wrote:
>> > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan
at gmail.com> wrote:
>> >> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
>> >>> Since it is "technically easy" to disable the
best fit conversion and
>> >>> the best fit is rarely good, how about providing an option
for
>> >>> code/package authors to disable it? I'm asking because
this is one of
>> >>> the most painful issues in packages that may need to
source() code
>> >>> containing UTF-8 characters that are not representable in
the Windows
>> >>> native encoding. Examples include knitr/rmarkdown and
shiny. Basically
>> >>> users won't be able to knit documents or run Shiny
apps correctly when
>> >>> the code contains characters that cannot be represented in
the native
>> >>> encoding.
>> >> Wouldn't things be worse with it disabled than currently? 
I'd expect
>> >> the line containing the "?" to end up as NA instead
of converting to "r".
>> > I don't think it would be worse, because in this case R would
not
>> > implicitly convert strings to (best fit) latin1 on Windows, but
>> > instead keep the (correct) string in its UTF-8 encoding. The NA
only
>> > appears if the user explicitly forces a conversion to latin1,
which is
>> > not the problem here I think.
>> >
>> > The original problem that I can reproduce in RGui is that if you
enter
>> >   "?" in RGui, R opportunistically converts this to
latin1, because it
>> > can. However if you enter text which can definitely not be
represented
>> > in latin1, R encodes the string correctly in UTF-8 form.
>>
>> Rgui is a "Windows Unicode" application (uses UTF16-LE) but
it needs to
>> convert the input to native encoding before passing it to R, which is
>> based on locales. However, that string is passed by R to the parser,
>> which Rgui takes advantage of and converts non-representable characters
>> to their \uxxxx escapes which are understood by the parser. Using this
>> trick, Unicode characters can get to the parser from Rgui (but of
course
>> then still in risk of conversion later when the program runs). Rgui
only
>> escapes characters that cannot be represented, unfortunately, the
>> standard C99 API for that implemented on Windows does the best fit.
This
>> could be fixed in Rgui by calling a special Windows API function and
>> could be done, but with the mentioned risk that it would break existing
>> uses that capture the existing behavior.
>>
>> This is the only place I know of where removing best fit would lead to
>> correct representation of UTF-8 characters. Other places will give NA,
>> some other escapes, code will fail to parse (e.g. "incomplete
string",
>> one can get that easily with source()).
>>
>> Tomas
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

Tomas Kalibera

2019-Apr-11 07:54 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 4/11/19 9:10 AM, Tom?? Bo?il wrote:> Or, if this cannot be done easily, please, disable the "utf-8"
value
> in source(..., ) function on Windows R.
> source(..., encoding = "utf-8")
> -> error: "utf-8" does not work right on Windows.
> -> (or, at least) warning: "utf-8" is handled by "best
fit" on Windows
> and some characters in string literals may be automatically changed.
>
> Because, at this state, the UTF-8 encoding of R source files on
> Windows is a fake Unicode as it can handle only 256 different ANSI
> characters in reality.
This is not a fair statement. source(,encoding="UTF-8") works as 
documented. It translates from (full) UTF-8 to current native encoding, 
which is documented. I believe the authors who made these design 
decisions over a decade ago, under different circumstances, and 
carefully implemented the code, tested, and documented for you to use 
for free, deserve to be addressed with some respect. It is not their 
responsibility to read the documentation for you, and if you had read 
and understood it, you would not have used source(,encoding="UTF-8") 
with characters not representable in current native encoding on Windows. 
The authors should not be blamed for that the design _today_ does not 
seem perfect for _todays_ systems (and how could they have guessed at 
that time Windows will still not support UTF-8 as native encoding today).

Tomas> Thanks,
> Tomas
>
>
> On Thu, Apr 11, 2019 at 8:53 AM Tom?? Bo?il <borilt at gmail.com>
wrote:
>> For me, this would be a perfect solution.
>>
>> I.e., do not use the ?best? fit and leave it to user?s competence:
>> a) in some functions, utf-8 works
>> b) in others -> error is thrown (e.g., incomplete string, NA, etc.)
>> => user has to change the code with his/her intentional ?best fit
string literal substitute? or use another function that can handle utf-8.
>>
>> Making an R code working right only on some platforms / trying to keep
a back-compatibility meaning ?the code does not do what you want and the
behaviour differs depending on each every locale but at least, it does not throw
an error? is generally not a good idea - it is dangerous. Users / coders should
know that there is something wrong with their strings and some characters are
?eaten alive?.
>>
>> Tomas
>>
>> ?t 11. 4. 2019 v 8:26 odes?latel Tomas Kalibera <tomas.kalibera at
gmail.com> napsal:
>>> On 4/10/19 6:32 PM, Jeroen Ooms wrote:
>>>> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
>>>>> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
>>>>>> Since it is "technically easy" to disable the
best fit conversion and
>>>>>> the best fit is rarely good, how about providing an
option for
>>>>>> code/package authors to disable it? I'm asking
because this is one of
>>>>>> the most painful issues in packages that may need to
source() code
>>>>>> containing UTF-8 characters that are not representable
in the Windows
>>>>>> native encoding. Examples include knitr/rmarkdown and
shiny. Basically
>>>>>> users won't be able to knit documents or run Shiny
apps correctly when
>>>>>> the code contains characters that cannot be represented
in the native
>>>>>> encoding.
>>>>> Wouldn't things be worse with it disabled than
currently?  I'd expect
>>>>> the line containing the "?" to end up as NA
instead of converting to "r".
>>>> I don't think it would be worse, because in this case R
would not
>>>> implicitly convert strings to (best fit) latin1 on Windows, but
>>>> instead keep the (correct) string in its UTF-8 encoding. The NA
only
>>>> appears if the user explicitly forces a conversion to latin1,
which is
>>>> not the problem here I think.
>>>>
>>>> The original problem that I can reproduce in RGui is that if
you enter
>>>>    "?" in RGui, R opportunistically converts this to
latin1, because it
>>>> can. However if you enter text which can definitely not be
represented
>>>> in latin1, R encodes the string correctly in UTF-8 form.
>>> Rgui is a "Windows Unicode" application (uses UTF16-LE)
but it needs to
>>> convert the input to native encoding before passing it to R, which
is
>>> based on locales. However, that string is passed by R to the
parser,
>>> which Rgui takes advantage of and converts non-representable
characters
>>> to their \uxxxx escapes which are understood by the parser. Using
this
>>> trick, Unicode characters can get to the parser from Rgui (but of
course
>>> then still in risk of conversion later when the program runs). Rgui
only
>>> escapes characters that cannot be represented, unfortunately, the
>>> standard C99 API for that implemented on Windows does the best fit.
This
>>> could be fixed in Rgui by calling a special Windows API function
and
>>> could be done, but with the mentioned risk that it would break
existing
>>> uses that capture the existing behavior.
>>>
>>> This is the only place I know of where removing best fit would lead
to
>>> correct representation of UTF-8 characters. Other places will give
NA,
>>> some other escapes, code will fail to parse (e.g. "incomplete
string",
>>> one can get that easily with source()).
>>>
>>> Tomas
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

Tomáš Bořil

2019-Apr-11 08:10 UTC

head link

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

I do not blame anybody and I do have a huge respect to all authors of
R. Actually, I like R very much and I would like to thank to everyone
who contributes to it. I use R regularly in my work (moved from Java,
C# and Matlab), I have created a package rPraat for phonetic analyses
and I think R is a very well designed language which will survive
decades. I am trying to bring new users (my students at non-technical
University) to use programming for their everyday problems
(statistics, phonetic analyses, text processing) and they enjoy R. I
am really positive in this (it is hard to express emotions in e-mails
without using emoticons in every sentence). And that is why I would
like it have even more perfect.

I only suggest to add one line of code (metaphorically) in source()
function in R for Windows to make it even better and to warn all users
who do not read a whole documentation for each function thoroughly and
carefully.

Tomas


On Thu, Apr 11, 2019 at 9:54 AM Tomas Kalibera <tomas.kalibera at
gmail.com> wrote:>
> On 4/11/19 9:10 AM, Tom?? Bo?il wrote:
> > Or, if this cannot be done easily, please, disable the
"utf-8" value
> > in source(..., ) function on Windows R.
> > source(..., encoding = "utf-8")
> > -> error: "utf-8" does not work right on Windows.
> > -> (or, at least) warning: "utf-8" is handled by
"best fit" on Windows
> > and some characters in string literals may be automatically changed.
> >
> > Because, at this state, the UTF-8 encoding of R source files on
> > Windows is a fake Unicode as it can handle only 256 different ANSI
> > characters in reality.
>
> This is not a fair statement. source(,encoding="UTF-8") works as
> documented. It translates from (full) UTF-8 to current native encoding,
> which is documented. I believe the authors who made these design
> decisions over a decade ago, under different circumstances, and
> carefully implemented the code, tested, and documented for you to use
> for free, deserve to be addressed with some respect. It is not their
> responsibility to read the documentation for you, and if you had read
> and understood it, you would not have used
source(,encoding="UTF-8")
> with characters not representable in current native encoding on Windows.
> The authors should not be blamed for that the design _today_ does not
> seem perfect for _todays_ systems (and how could they have guessed at
> that time Windows will still not support UTF-8 as native encoding today).
>
> Tomas
> > Thanks,
> > Tomas
> >
> >
> > On Thu, Apr 11, 2019 at 8:53 AM Tom?? Bo?il <borilt at
gmail.com> wrote:
> >> For me, this would be a perfect solution.
> >>
> >> I.e., do not use the ?best? fit and leave it to user?s competence:
> >> a) in some functions, utf-8 works
> >> b) in others -> error is thrown (e.g., incomplete string, NA,
etc.)
> >> => user has to change the code with his/her intentional ?best
fit string literal substitute? or use another function that can handle utf-8.
> >>
> >> Making an R code working right only on some platforms / trying to
keep a back-compatibility meaning ?the code does not do what you want and the
behaviour differs depending on each every locale but at least, it does not throw
an error? is generally not a good idea - it is dangerous. Users / coders should
know that there is something wrong with their strings and some characters are
?eaten alive?.
> >>
> >> Tomas
> >>
> >> ?t 11. 4. 2019 v 8:26 odes?latel Tomas Kalibera <tomas.kalibera
at gmail.com> napsal:
> >>> On 4/10/19 6:32 PM, Jeroen Ooms wrote:
> >>>> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
> >>>>> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
> >>>>>> Since it is "technically easy" to
disable the best fit conversion and
> >>>>>> the best fit is rarely good, how about providing
an option for
> >>>>>> code/package authors to disable it? I'm asking
because this is one of
> >>>>>> the most painful issues in packages that may need
to source() code
> >>>>>> containing UTF-8 characters that are not
representable in the Windows
> >>>>>> native encoding. Examples include knitr/rmarkdown
and shiny. Basically
> >>>>>> users won't be able to knit documents or run
Shiny apps correctly when
> >>>>>> the code contains characters that cannot be
represented in the native
> >>>>>> encoding.
> >>>>> Wouldn't things be worse with it disabled than
currently?  I'd expect
> >>>>> the line containing the "?" to end up as NA
instead of converting to "r".
> >>>> I don't think it would be worse, because in this case
R would not
> >>>> implicitly convert strings to (best fit) latin1 on
Windows, but
> >>>> instead keep the (correct) string in its UTF-8 encoding.
The NA only
> >>>> appears if the user explicitly forces a conversion to
latin1, which is
> >>>> not the problem here I think.
> >>>>
> >>>> The original problem that I can reproduce in RGui is that
if you enter
> >>>>    "?" in RGui, R opportunistically converts
this to latin1, because it
> >>>> can. However if you enter text which can definitely not be
represented
> >>>> in latin1, R encodes the string correctly in UTF-8 form.
> >>> Rgui is a "Windows Unicode" application (uses
UTF16-LE) but it needs to
> >>> convert the input to native encoding before passing it to R,
which is
> >>> based on locales. However, that string is passed by R to the
parser,
> >>> which Rgui takes advantage of and converts non-representable
characters
> >>> to their \uxxxx escapes which are understood by the parser.
Using this
> >>> trick, Unicode characters can get to the parser from Rgui (but
of course
> >>> then still in risk of conversion later when the program runs).
Rgui only
> >>> escapes characters that cannot be represented, unfortunately,
the
> >>> standard C99 API for that implemented on Windows does the best
fit. This
> >>> could be fixed in Rgui by calling a special Windows API
function and
> >>> could be done, but with the mentioned risk that it would break
existing
> >>> uses that capture the existing behavior.
> >>>
> >>> This is the only place I know of where removing best fit would
lead to
> >>> correct representation of UTF-8 characters. Other places will
give NA,
> >>> some other escapes, code will fail to parse (e.g.
"incomplete string",
> >>> one can get that easily with source()).
> >>>
> >>> Tomas
> >>>
> >>> ______________________________________________
> >>> R-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

Maybe Matching Threads

Search for more reasonably related threads

R devel - Apr 2019 - R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Maybe Matching Threads