thr3ads.net - R devel - [Rd] R on Windows with UCRT and the system encoding [Dec 2021]

If this information is useful, please help other people find it:
Share via:

Hiroaki Yutani

2021-Dec-21 14:47 UTC

[Rd] R on Windows with UCRT and the system encoding

Hi Tomas,

Thank you very much for the detailed explanation! I think now I have a
bit better understanding on how the things work; at least now I know I
didn't understand the concept of "active code page". I'll
follow your
advice when I need to fix the packages that need some tweaks to handle
UTF-8 properly.

Sorry, I'd like to ask one more question related to locale. If I copy
the following text and execute `read.csv("clipboard")`, it returns
"uao" instead of "???" (the characters are transliterated).

    "col1","col2"
    "???","???"


While this is probably the status quo (the same behavior on R 4.1) on
Latin-1 encoding, things are worse on CJK locales. If I try,

    "col1","col2"
    "?","?"

I get the following error:

    > read.csv("clipboard")
    Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  :
      invalid multibyte string at '<82><a0>'

Is this supposed to work? It seems the characters are encoded as CP932
(my system locale) but marked as UTF-8.

    > x <- utils:::readClipboard()
    > x
    [1] "\"col1\",\"col2\""        
"\"\x82\xa0\",\"\x82\xa2\""
    > iconv(x, from = "CP932", to = "UTF-8")
    [1] "\"col1\",\"col2\""
"\"?\",\"?\""

I read the source code of readClipboard() in
src/library/utils/src/windows/util.c, but have no idea if there's
anything that needs to be fixed.

Best,
Yutani

2021?12?21?(?) 17:26 Tomas Kalibera <tomas.kalibera at gmail.com>:




>
> Hi Yutani,
>
> On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
> > Hi,
> >
> > I'm more than excited about the announcement about the upcoming
UTF-8
> > R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
> > work on Windows with non-UTF-8 encoding as the system locale? I think
> > this blog post indicates so (as this describes the older Windows than
> > the UTF-8 era), but I'm not fully confident if I understand the
> > details correctly.
>
> R 4.2 will automatically use UTF-8 as the active code page (system
> locale) and the C library encoding and the R current native encoding on
> systems which allow this (recent Windows 10 and newer, Windows Server
> 2022, etc). There is no way to opt-out from that, and of course no
> reason to, either. It does not matter of what is the system locale set
> in Windows for the whole system - these recent Windows allow individual
> applications to override the system-wide setting to UTF-8, which is what
> R does. Typically the system-wide setting will not be UTF-8, because
> many applications will not work with that.
>
> On older systems, R 4.2 will run in some other system locale and the
> same C library encoding and R current native encoding - the same system
> default as R 4.1 would run on that system. So for some time, encoding
> support for this in R will have to stay, but eventually will be removed.
> But yes, R 4.2 is still supposed to work on such systems.
>
> >
https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
> >
> > If so, I'm curious what the package authors should do when the
locales
> > are different between OS and R. For example (disclaimer: I don't
> > intend to blame processx at all. Just for an example), the CRAN check
> > on the processx package currently fails with this warning on R-devel
> > Windows.
> >
> >>      1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte
character at end of stream ignored
> > https://cran.r-project.org/web/checks/check_results_processx.html
> >
> > As far as I know, processx launches an external process and captures
> > its output, and I suspect the problem is that the output of the
> > process is encoded in non-UTF-8 while R assumes it's UTF-8. I
> > experienced similar problems with other packages as well, which
> > disappear if I switch the locale to the same one as the OS by
> > Sys.setlocale(). So, I think it would be great if there's some
> > guidance for the package authors on how to handle these properly.
>
> Incidentally I've debugged this case and sent a detailed analysis to
the
> maintainer, so he knows about the problem.
>
> In short, you cannot assume in Windows that different applications use
> the same system encoding. That is not true at least with the invention
> of the fusion manifests which allow an application to switch to UTF-8 as
> system encoding, which R does. So, when using an external application on
> Windows, you need to know and respect a specific encoding used by that
> application on input and output.
>
> As an example based on processx, you have an application which prints
> its argument to standard output. If you do it this way:
>
> $ cat pr.c
> #include <stdio.h>
> #include <locale.h>
> #include <string.h>
> int main(int argc, char **argv) {
>
>          printf("Locale set to: %s\n", setlocale(LC_ALL,
""));
>          int i;
>          for(i = 0; i < argc; i++) {
>                  printf("Argument %d\n", i);
>                  printf("%s\n", argv[i]);
>                  for(int j = 0; j < strlen(argv[i]); j++) {
>                          printf("byte[%d] is %x (%d)\n", i,
(unsigned
> char)argv[i][j], (unsigned char)
>                  }
>          }
>          return 0;
> }
>
> the argument and hence output will be in the current native encoding of
> pr.c, because that's the encoding in which the argument will be
received
> from Windows, so by default the system locale encoding, so by default
> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
> One should also only use such programs with characters representable in
> Latin-1 on such systems. When you call such application from R with
> UTF-8 as native encoding, Windows will automatically convert the
> arguments to Latin-1.
>
> The old Windows way to avoid this problem is to use the wide-character
> API (now UTF-16LE):
>
> $ cat prw.c
> #include <stdio.h>
> #include <locale.h>
> #include <string.h>
>
> int wmain(int argc, wchar_t **argv) {
>
>          int i;
>          for(i = 0; i < argc; i++) {
>                  wprintf(L"Argument %d\n", i);
>                  wprintf(argv[i]);
>                  wprintf(L"\n");
>                  for(int j = 0; j < wcslen(argv[i]); j++)
>                          wprintf(L"Word[%d] %x\n", j,
> (unsigned)argv[i][j]);
>          }
>          return 0;
> }
>
> When you call such program from R with UTF-8 as native encoding, Windows
> will convert the arguments to UTF-16LE (so all characters will be
> representable). But you need to write Windows-specific code for this.
>
> The new Windows way to avoid this problem is to use UTF-8 as the native
> encoding via the fusion manifest, as R does. You can use the
"pr.c" as
> above, but with something like
>
> $ cat pr.rc
> #include <windows.h>
> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
>
> $ cat pr.manifest
> <?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
> <assembly xmlns="urn:schemas-microsoft-com:asm.v1"
manifestVersion="1.0">
> <assemblyIdentity
>      version="1.0.0.0"
>      processorArchitecture="amd64"
>      name="pr.exe"
>      type="win32"
> />
> <application>
>    <windowsSettings>
>      <activeCodePage
>
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
>    </windowsSettings>
> </application>
> </assembly>
>
> windres.exe -i pr.rc -o pr_rc.o
> gcc -o pr pr.c pr_rc.o
>
> When you build the application this way, it will use UTF-8 as native
> encoding, so when you call it from R (with UTF-8) as native encoding, no
> input conversion will occur. However, when you do this, the output from
> the application will also be in UTF-8.
>
> So, for applications you control, my recommendation would be to make
> them use Unicode one of these two ways. Preferably the new one, with the
> fusion manifest. Only if it were a Windows-only application, and had to
> work on older Windows, then the wide-character version (but such apps
> are probably not in R packages).
>
> When working with external applications you don't control, it is harder
> - you need to know which encoding they are expecting and producing, in
> whatever interface you use, and convert that, e.g. using iconv(). By the
> interface I mean that e.g., the command-line arguments are converted by
> Windows, but the input/output sent over a file/stream will not be.
>
> Of course, this works the other way around as well. If you were using R
> with some other external applications expecting a different encoding,
> you would need to handle that (by conversions). With applications you
> control, it would make sense using this opportunity to switch to UTF-8.
> But, in principle, you can use iconv() from R directly or indirectly to
> convert input/output streams to/from a known encoding.
>
> I am happy to give more suggestions if there is interest, but for that
> it would be useful to have a specific example (with processx, it is
> clear what the options R, there the application is controlled by the
> package).
>
> Best
> Tomas
> >
> > Any suggestions?
> >
> > Best,
> > Yutani
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

Tomas Kalibera

2021-Dec-21 15:23 UTC

head link

[Rd] R on Windows with UCRT and the system encoding

Hi Yutani,

On 12/21/21 3:47 PM, Hiroaki Yutani wrote:> Hi Tomas,
>
> Thank you very much for the detailed explanation! I think now I have a
> bit better understanding on how the things work; at least now I know I
> didn't understand the concept of "active code page". I'll
follow your
> advice when I need to fix the packages that need some tweaks to handle
> UTF-8 properly.
>
> Sorry, I'd like to ask one more question related to locale. If I copy
> the following text and execute `read.csv("clipboard")`, it
returns
> "uao" instead of "???" (the characters are
transliterated).
>
>      "col1","col2"
>      "???","???"
>
>
> While this is probably the status quo (the same behavior on R 4.1) on
> Latin-1 encoding, things are worse on CJK locales. If I try,
>
>      "col1","col2"
>      "?","?"
>
> I get the following error:
>
>      > read.csv("clipboard")
>      Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, 
:
>        invalid multibyte string at '<82><a0>'
>
> Is this supposed to work? It seems the characters are encoded as CP932
> (my system locale) but marked as UTF-8.
>
>      > x <- utils:::readClipboard()
>      > x
>      [1] "\"col1\",\"col2\""        
"\"\x82\xa0\",\"\x82\xa2\""
>      > iconv(x, from = "CP932", to = "UTF-8")
>      [1] "\"col1\",\"col2\""
"\"?\",\"?\""
>
> I read the source code of readClipboard() in
> src/library/utils/src/windows/util.c, but have no idea if there's
> anything that needs to be fixed.
Yes, this should work. I can reproduce the problem on my system, the 
clipboard apparently contains the Unicode characters, but R does not get 
them correctly, and from my quick read, it is a bug in R.

My guess is this is in connections.c, where we call 
GetClipboardData(CF_TEXT). Perhaps if we used CF_UNICODETEXT, it would 
work (or alternatively CF_TEXT but also CF_LOCALE to find out what is 
the locale used, but CF_UNICODETEXT seems simpler). See
https://docs.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats

As you started looking at the code, would you like to try 
debugging/fixing this?

Best
Tomas
>
> Best,
> Yutani
>
> 2021?12?21?(?) 17:26 Tomas Kalibera <tomas.kalibera at gmail.com>:
>
>
>
>
>
>> Hi Yutani,
>>
>> On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
>>> Hi,
>>>
>>> I'm more than excited about the announcement about the upcoming
UTF-8
>>> R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
>>> work on Windows with non-UTF-8 encoding as the system locale? I
think
>>> this blog post indicates so (as this describes the older Windows
than
>>> the UTF-8 era), but I'm not fully confident if I understand the
>>> details correctly.
>> R 4.2 will automatically use UTF-8 as the active code page (system
>> locale) and the C library encoding and the R current native encoding on
>> systems which allow this (recent Windows 10 and newer, Windows Server
>> 2022, etc). There is no way to opt-out from that, and of course no
>> reason to, either. It does not matter of what is the system locale set
>> in Windows for the whole system - these recent Windows allow individual
>> applications to override the system-wide setting to UTF-8, which is
what
>> R does. Typically the system-wide setting will not be UTF-8, because
>> many applications will not work with that.
>>
>> On older systems, R 4.2 will run in some other system locale and the
>> same C library encoding and R current native encoding - the same system
>> default as R 4.1 would run on that system. So for some time, encoding
>> support for this in R will have to stay, but eventually will be
removed.
>> But yes, R 4.2 is still supposed to work on such systems.
>>
>>>
https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
>>>
>>> If so, I'm curious what the package authors should do when the
locales
>>> are different between OS and R. For example (disclaimer: I
don't
>>> intend to blame processx at all. Just for an example), the CRAN
check
>>> on the processx package currently fails with this warning on
R-devel
>>> Windows.
>>>
>>>>       1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid
multi-byte character at end of stream ignored
>>> https://cran.r-project.org/web/checks/check_results_processx.html
>>>
>>> As far as I know, processx launches an external process and
captures
>>> its output, and I suspect the problem is that the output of the
>>> process is encoded in non-UTF-8 while R assumes it's UTF-8. I
>>> experienced similar problems with other packages as well, which
>>> disappear if I switch the locale to the same one as the OS by
>>> Sys.setlocale(). So, I think it would be great if there's some
>>> guidance for the package authors on how to handle these properly.
>> Incidentally I've debugged this case and sent a detailed analysis
to the
>> maintainer, so he knows about the problem.
>>
>> In short, you cannot assume in Windows that different applications use
>> the same system encoding. That is not true at least with the invention
>> of the fusion manifests which allow an application to switch to UTF-8
as
>> system encoding, which R does. So, when using an external application
on
>> Windows, you need to know and respect a specific encoding used by that
>> application on input and output.
>>
>> As an example based on processx, you have an application which prints
>> its argument to standard output. If you do it this way:
>>
>> $ cat pr.c
>> #include <stdio.h>
>> #include <locale.h>
>> #include <string.h>
>> int main(int argc, char **argv) {
>>
>>           printf("Locale set to: %s\n", setlocale(LC_ALL,
""));
>>           int i;
>>           for(i = 0; i < argc; i++) {
>>                   printf("Argument %d\n", i);
>>                   printf("%s\n", argv[i]);
>>                   for(int j = 0; j < strlen(argv[i]); j++) {
>>                           printf("byte[%d] is %x (%d)\n", i,
(unsigned
>> char)argv[i][j], (unsigned char)
>>                   }
>>           }
>>           return 0;
>> }
>>
>> the argument and hence output will be in the current native encoding of
>> pr.c, because that's the encoding in which the argument will be
received
>> from Windows, so by default the system locale encoding, so by default
>> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
>> One should also only use such programs with characters representable in
>> Latin-1 on such systems. When you call such application from R with
>> UTF-8 as native encoding, Windows will automatically convert the
>> arguments to Latin-1.
>>
>> The old Windows way to avoid this problem is to use the wide-character
>> API (now UTF-16LE):
>>
>> $ cat prw.c
>> #include <stdio.h>
>> #include <locale.h>
>> #include <string.h>
>>
>> int wmain(int argc, wchar_t **argv) {
>>
>>           int i;
>>           for(i = 0; i < argc; i++) {
>>                   wprintf(L"Argument %d\n", i);
>>                   wprintf(argv[i]);
>>                   wprintf(L"\n");
>>                   for(int j = 0; j < wcslen(argv[i]); j++)
>>                           wprintf(L"Word[%d] %x\n", j,
>> (unsigned)argv[i][j]);
>>           }
>>           return 0;
>> }
>>
>> When you call such program from R with UTF-8 as native encoding,
Windows
>> will convert the arguments to UTF-16LE (so all characters will be
>> representable). But you need to write Windows-specific code for this.
>>
>> The new Windows way to avoid this problem is to use UTF-8 as the native
>> encoding via the fusion manifest, as R does. You can use the
"pr.c" as
>> above, but with something like
>>
>> $ cat pr.rc
>> #include <windows.h>
>> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
>>
>> $ cat pr.manifest
>> <?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
>> <assembly xmlns="urn:schemas-microsoft-com:asm.v1"
manifestVersion="1.0">
>> <assemblyIdentity
>>       version="1.0.0.0"
>>       processorArchitecture="amd64"
>>       name="pr.exe"
>>       type="win32"
>> />
>> <application>
>>     <windowsSettings>
>>       <activeCodePage
>>
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
>>     </windowsSettings>
>> </application>
>> </assembly>
>>
>> windres.exe -i pr.rc -o pr_rc.o
>> gcc -o pr pr.c pr_rc.o
>>
>> When you build the application this way, it will use UTF-8 as native
>> encoding, so when you call it from R (with UTF-8) as native encoding,
no
>> input conversion will occur. However, when you do this, the output from
>> the application will also be in UTF-8.
>>
>> So, for applications you control, my recommendation would be to make
>> them use Unicode one of these two ways. Preferably the new one, with
the
>> fusion manifest. Only if it were a Windows-only application, and had to
>> work on older Windows, then the wide-character version (but such apps
>> are probably not in R packages).
>>
>> When working with external applications you don't control, it is
harder
>> - you need to know which encoding they are expecting and producing, in
>> whatever interface you use, and convert that, e.g. using iconv(). By
the
>> interface I mean that e.g., the command-line arguments are converted by
>> Windows, but the input/output sent over a file/stream will not be.
>>
>> Of course, this works the other way around as well. If you were using R
>> with some other external applications expecting a different encoding,
>> you would need to handle that (by conversions). With applications you
>> control, it would make sense using this opportunity to switch to UTF-8.
>> But, in principle, you can use iconv() from R directly or indirectly to
>> convert input/output streams to/from a known encoding.
>>
>> I am happy to give more suggestions if there is interest, but for that
>> it would be useful to have a specific example (with processx, it is
>> clear what the options R, there the application is controlled by the
>> package).
>>
>> Best
>> Tomas
>>> Any suggestions?
>>>
>>> Best,
>>> Yutani
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Dec 2021 - R on Windows with UCRT and the system encoding

[Rd] R on Windows with UCRT and the system encoding

[Rd] R on Windows with UCRT and the system encoding