Tomas Kalibera
2019-Apr-10 11:26 UTC
[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/10/19 1:14 PM, Jeroen Ooms wrote:> On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com> wrote: >> Minimalistic example: >> Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui console: >>> "?" >> [1] "r" >> >> Although the script is in UTF-8, the characters are replaced by >> "simplified" substitutes uncontrollably (depending on OS locale). The >> same goes with simply entering the code statements in R Console. >> >> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) > I think this is a "feature" of win_iconv that is bundled with base R > on Windows (./src/extra/win_iconv). The character from your example is > not part of the latin1 (iso-8859-1) set, however, win-iconv seems to > do so anyway: > >> x <- "\U00159" >> print(x) > [1] "?" >> iconv(x, 'UTF-8', 'iso-8859-1') > [1] "r" > > On MacOS, iconv tells us this character cannot be represented as latin1: > >> x <- "\U00159" >> print(x) > [1] "?" >> iconv(x, 'UTF-8', 'iso-8859-1') > [1] NA > > I'm actually not sure why base-R needs win_iconv (but I'm not an > encoding expert at all). Perhaps we could try to unbundle it and use > the standard libiconv provided by the Rtools toolchain bundle to get > more consistent results.win_iconv just calls into Windows API to do the conversion, it is technically easy to disable the "best fit" conversion, but I think it won't be a good idea. In some cases, perhaps rare, the best fit is good, actually including the conversion from "?" to "r" which makes perfect sense. But more importantly, changing the behavior could affect users who expect the substitution to happen because it has been happening for many years, and it won't help others much. Tomas> > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Yihui Xie
2019-Apr-10 14:29 UTC
[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Since it is "technically easy" to disable the best fit conversion and the best fit is rarely good, how about providing an option for code/package authors to disable it? I'm asking because this is one of the most painful issues in packages that may need to source() code containing UTF-8 characters that are not representable in the Windows native encoding. Examples include knitr/rmarkdown and shiny. Basically users won't be able to knit documents or run Shiny apps correctly when the code contains characters that cannot be represented in the native encoding. Regards, Yihui -- https://yihui.name On Wed, Apr 10, 2019 at 6:36 AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> > On 4/10/19 1:14 PM, Jeroen Ooms wrote: > > On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com> wrote: > >> Minimalistic example: > >> Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui console: > >>> "?" > >> [1] "r" > >> > >> Although the script is in UTF-8, the characters are replaced by > >> "simplified" substitutes uncontrollably (depending on OS locale). The > >> same goes with simply entering the code statements in R Console. > >> > >> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) > > I think this is a "feature" of win_iconv that is bundled with base R > > on Windows (./src/extra/win_iconv). The character from your example is > > not part of the latin1 (iso-8859-1) set, however, win-iconv seems to > > do so anyway: > > > >> x <- "\U00159" > >> print(x) > > [1] "?" > >> iconv(x, 'UTF-8', 'iso-8859-1') > > [1] "r" > > > > On MacOS, iconv tells us this character cannot be represented as latin1: > > > >> x <- "\U00159" > >> print(x) > > [1] "?" > >> iconv(x, 'UTF-8', 'iso-8859-1') > > [1] NA > > > > I'm actually not sure why base-R needs win_iconv (but I'm not an > > encoding expert at all). Perhaps we could try to unbundle it and use > > the standard libiconv provided by the Rtools toolchain bundle to get > > more consistent results. > > win_iconv just calls into Windows API to do the conversion, it is > technically easy to disable the "best fit" conversion, but I think it > won't be a good idea. In some cases, perhaps rare, the best fit is good, > actually including the conversion from "?" to "r" which makes perfect > sense. But more importantly, changing the behavior could affect users > who expect the substitution to happen because it has been happening for > many years, and it won't help others much. > > Tomas > > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Duncan Murdoch
2019-Apr-10 15:45 UTC
[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 10/04/2019 10:29 a.m., Yihui Xie wrote:> Since it is "technically easy" to disable the best fit conversion and > the best fit is rarely good, how about providing an option for > code/package authors to disable it? I'm asking because this is one of > the most painful issues in packages that may need to source() code > containing UTF-8 characters that are not representable in the Windows > native encoding. Examples include knitr/rmarkdown and shiny. Basically > users won't be able to knit documents or run Shiny apps correctly when > the code contains characters that cannot be represented in the native > encoding.Wouldn't things be worse with it disabled than currently? I'd expect the line containing the "?" to end up as NA instead of converting to "r". Of course, it would be best to be able to declare source files as UTF-8 and avoid any conversion at all, but as Tomas said, that's a lot harder. Duncan Murdoch> > Regards, > Yihui > -- > https://yihui.name > > On Wed, Apr 10, 2019 at 6:36 AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote: >> >> On 4/10/19 1:14 PM, Jeroen Ooms wrote: >>> On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com> wrote: >>>> Minimalistic example: >>>> Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui console: >>>>> "?" >>>> [1] "r" >>>> >>>> Although the script is in UTF-8, the characters are replaced by >>>> "simplified" substitutes uncontrollably (depending on OS locale). The >>>> same goes with simply entering the code statements in R Console. >>>> >>>> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) >>> I think this is a "feature" of win_iconv that is bundled with base R >>> on Windows (./src/extra/win_iconv). The character from your example is >>> not part of the latin1 (iso-8859-1) set, however, win-iconv seems to >>> do so anyway: >>> >>>> x <- "\U00159" >>>> print(x) >>> [1] "?" >>>> iconv(x, 'UTF-8', 'iso-8859-1') >>> [1] "r" >>> >>> On MacOS, iconv tells us this character cannot be represented as latin1: >>> >>>> x <- "\U00159" >>>> print(x) >>> [1] "?" >>>> iconv(x, 'UTF-8', 'iso-8859-1') >>> [1] NA >>> >>> I'm actually not sure why base-R needs win_iconv (but I'm not an >>> encoding expert at all). Perhaps we could try to unbundle it and use >>> the standard libiconv provided by the Rtools toolchain bundle to get >>> more consistent results. >> >> win_iconv just calls into Windows API to do the conversion, it is >> technically easy to disable the "best fit" conversion, but I think it >> won't be a good idea. In some cases, perhaps rare, the best fit is good, >> actually including the conversion from "?" to "r" which makes perfect >> sense. But more importantly, changing the behavior could affect users >> who expect the substitution to happen because it has been happening for >> many years, and it won't help others much. >> >> Tomas >> >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Maybe Matching Threads
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
- R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones