Tomas Kalibera
2019-Apr-11 06:25 UTC
[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/10/19 6:32 PM, Jeroen Ooms wrote:> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote: >> On 10/04/2019 10:29 a.m., Yihui Xie wrote: >>> Since it is "technically easy" to disable the best fit conversion and >>> the best fit is rarely good, how about providing an option for >>> code/package authors to disable it? I'm asking because this is one of >>> the most painful issues in packages that may need to source() code >>> containing UTF-8 characters that are not representable in the Windows >>> native encoding. Examples include knitr/rmarkdown and shiny. Basically >>> users won't be able to knit documents or run Shiny apps correctly when >>> the code contains characters that cannot be represented in the native >>> encoding. >> Wouldn't things be worse with it disabled than currently? I'd expect >> the line containing the "?" to end up as NA instead of converting to "r". > I don't think it would be worse, because in this case R would not > implicitly convert strings to (best fit) latin1 on Windows, but > instead keep the (correct) string in its UTF-8 encoding. The NA only > appears if the user explicitly forces a conversion to latin1, which is > not the problem here I think. > > The original problem that I can reproduce in RGui is that if you enter > "?" in RGui, R opportunistically converts this to latin1, because it > can. However if you enter text which can definitely not be represented > in latin1, R encodes the string correctly in UTF-8 form.Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to convert the input to native encoding before passing it to R, which is based on locales. However, that string is passed by R to the parser, which Rgui takes advantage of and converts non-representable characters to their \uxxxx escapes which are understood by the parser. Using this trick, Unicode characters can get to the parser from Rgui (but of course then still in risk of conversion later when the program runs). Rgui only escapes characters that cannot be represented, unfortunately, the standard C99 API for that implemented on Windows does the best fit. This could be fixed in Rgui by calling a special Windows API function and could be done, but with the mentioned risk that it would break existing uses that capture the existing behavior. This is the only place I know of where removing best fit would lead to correct representation of UTF-8 characters. Other places will give NA, some other escapes, code will fail to parse (e.g. "incomplete string", one can get that easily with source()). Tomas
Tomáš Bořil
2019-Apr-11 06:53 UTC
[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
For me, this would be a perfect solution. I.e., do not use the ?best? fit and leave it to user?s competence: a) in some functions, utf-8 works b) in others -> error is thrown (e.g., incomplete string, NA, etc.) => user has to change the code with his/her intentional ?best fit string literal substitute? or use another function that can handle utf-8. Making an R code working right only on some platforms / trying to keep a back-compatibility meaning ?the code does not do what you want and the behaviour differs depending on each every locale but at least, it does not throw an error? is generally not a good idea - it is dangerous. Users / coders should know that there is something wrong with their strings and some characters are ?eaten alive?. Tomas ?t 11. 4. 2019 v 8:26 odes?latel Tomas Kalibera <tomas.kalibera at gmail.com> napsal:> On 4/10/19 6:32 PM, Jeroen Ooms wrote: > > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > >> On 10/04/2019 10:29 a.m., Yihui Xie wrote: > >>> Since it is "technically easy" to disable the best fit conversion and > >>> the best fit is rarely good, how about providing an option for > >>> code/package authors to disable it? I'm asking because this is one of > >>> the most painful issues in packages that may need to source() code > >>> containing UTF-8 characters that are not representable in the Windows > >>> native encoding. Examples include knitr/rmarkdown and shiny. Basically > >>> users won't be able to knit documents or run Shiny apps correctly when > >>> the code contains characters that cannot be represented in the native > >>> encoding. > >> Wouldn't things be worse with it disabled than currently? I'd expect > >> the line containing the "?" to end up as NA instead of converting to > "r". > > I don't think it would be worse, because in this case R would not > > implicitly convert strings to (best fit) latin1 on Windows, but > > instead keep the (correct) string in its UTF-8 encoding. The NA only > > appears if the user explicitly forces a conversion to latin1, which is > > not the problem here I think. > > > > The original problem that I can reproduce in RGui is that if you enter > > "?" in RGui, R opportunistically converts this to latin1, because it > > can. However if you enter text which can definitely not be represented > > in latin1, R encodes the string correctly in UTF-8 form. > > Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to > convert the input to native encoding before passing it to R, which is > based on locales. However, that string is passed by R to the parser, > which Rgui takes advantage of and converts non-representable characters > to their \uxxxx escapes which are understood by the parser. Using this > trick, Unicode characters can get to the parser from Rgui (but of course > then still in risk of conversion later when the program runs). Rgui only > escapes characters that cannot be represented, unfortunately, the > standard C99 API for that implemented on Windows does the best fit. This > could be fixed in Rgui by calling a special Windows API function and > could be done, but with the mentioned risk that it would break existing > uses that capture the existing behavior. > > This is the only place I know of where removing best fit would lead to > correct representation of UTF-8 characters. Other places will give NA, > some other escapes, code will fail to parse (e.g. "incomplete string", > one can get that easily with source()). > > Tomas > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Tomáš Bořil
2019-Apr-11 07:10 UTC
[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Or, if this cannot be done easily, please, disable the "utf-8" value in source(..., ) function on Windows R. source(..., encoding = "utf-8") -> error: "utf-8" does not work right on Windows. -> (or, at least) warning: "utf-8" is handled by "best fit" on Windows and some characters in string literals may be automatically changed. Because, at this state, the UTF-8 encoding of R source files on Windows is a fake Unicode as it can handle only 256 different ANSI characters in reality. Thanks, Tomas On Thu, Apr 11, 2019 at 8:53 AM Tom?? Bo?il <borilt at gmail.com> wrote:> > For me, this would be a perfect solution. > > I.e., do not use the ?best? fit and leave it to user?s competence: > a) in some functions, utf-8 works > b) in others -> error is thrown (e.g., incomplete string, NA, etc.) > => user has to change the code with his/her intentional ?best fit string literal substitute? or use another function that can handle utf-8. > > Making an R code working right only on some platforms / trying to keep a back-compatibility meaning ?the code does not do what you want and the behaviour differs depending on each every locale but at least, it does not throw an error? is generally not a good idea - it is dangerous. Users / coders should know that there is something wrong with their strings and some characters are ?eaten alive?. > > Tomas > > ?t 11. 4. 2019 v 8:26 odes?latel Tomas Kalibera <tomas.kalibera at gmail.com> napsal: >> >> On 4/10/19 6:32 PM, Jeroen Ooms wrote: >> > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote: >> >> On 10/04/2019 10:29 a.m., Yihui Xie wrote: >> >>> Since it is "technically easy" to disable the best fit conversion and >> >>> the best fit is rarely good, how about providing an option for >> >>> code/package authors to disable it? I'm asking because this is one of >> >>> the most painful issues in packages that may need to source() code >> >>> containing UTF-8 characters that are not representable in the Windows >> >>> native encoding. Examples include knitr/rmarkdown and shiny. Basically >> >>> users won't be able to knit documents or run Shiny apps correctly when >> >>> the code contains characters that cannot be represented in the native >> >>> encoding. >> >> Wouldn't things be worse with it disabled than currently? I'd expect >> >> the line containing the "?" to end up as NA instead of converting to "r". >> > I don't think it would be worse, because in this case R would not >> > implicitly convert strings to (best fit) latin1 on Windows, but >> > instead keep the (correct) string in its UTF-8 encoding. The NA only >> > appears if the user explicitly forces a conversion to latin1, which is >> > not the problem here I think. >> > >> > The original problem that I can reproduce in RGui is that if you enter >> > "?" in RGui, R opportunistically converts this to latin1, because it >> > can. However if you enter text which can definitely not be represented >> > in latin1, R encodes the string correctly in UTF-8 form. >> >> Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to >> convert the input to native encoding before passing it to R, which is >> based on locales. However, that string is passed by R to the parser, >> which Rgui takes advantage of and converts non-representable characters >> to their \uxxxx escapes which are understood by the parser. Using this >> trick, Unicode characters can get to the parser from Rgui (but of course >> then still in risk of conversion later when the program runs). Rgui only >> escapes characters that cannot be represented, unfortunately, the >> standard C99 API for that implemented on Windows does the best fit. This >> could be fixed in Rgui by calling a special Windows API function and >> could be done, but with the mentioned risk that it would break existing >> uses that capture the existing behavior. >> >> This is the only place I know of where removing best fit would lead to >> correct representation of UTF-8 characters. Other places will give NA, >> some other escapes, code will fail to parse (e.g. "incomplete string", >> one can get that easily with source()). >> >> Tomas >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel
Reasonably Related Threads
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones