thr3ads.net - R devel - [Rd] R on Windows with UCRT and the system encoding [Dec 2021]

If this information is useful, please help other people find it:
Share via:

Tomas Kalibera

2021-Dec-21 08:26 UTC

[Rd] R on Windows with UCRT and the system encoding

Hi Yutani,

On 12/21/21 6:34 AM, Hiroaki Yutani wrote:> Hi,
>
> I'm more than excited about the announcement about the upcoming UTF-8
> R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
> work on Windows with non-UTF-8 encoding as the system locale? I think
> this blog post indicates so (as this describes the older Windows than
> the UTF-8 era), but I'm not fully confident if I understand the
> details correctly.
R 4.2 will automatically use UTF-8 as the active code page (system 
locale) and the C library encoding and the R current native encoding on 
systems which allow this (recent Windows 10 and newer, Windows Server 
2022, etc). There is no way to opt-out from that, and of course no 
reason to, either. It does not matter of what is the system locale set 
in Windows for the whole system - these recent Windows allow individual 
applications to override the system-wide setting to UTF-8, which is what 
R does. Typically the system-wide setting will not be UTF-8, because 
many applications will not work with that.

On older systems, R 4.2 will run in some other system locale and the 
same C library encoding and R current native encoding - the same system 
default as R 4.1 would run on that system. So for some time, encoding 
support for this in R will have to stay, but eventually will be removed. 
But yes, R 4.2 is still supposed to work on such systems.
>
https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
>
> If so, I'm curious what the package authors should do when the locales
> are different between OS and R. For example (disclaimer: I don't
> intend to blame processx at all. Just for an example), the CRAN check
> on the processx package currently fails with this warning on R-devel
> Windows.
>
>>      1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte
character at end of stream ignored
> https://cran.r-project.org/web/checks/check_results_processx.html
>
> As far as I know, processx launches an external process and captures
> its output, and I suspect the problem is that the output of the
> process is encoded in non-UTF-8 while R assumes it's UTF-8. I
> experienced similar problems with other packages as well, which
> disappear if I switch the locale to the same one as the OS by
> Sys.setlocale(). So, I think it would be great if there's some
> guidance for the package authors on how to handle these properly.
Incidentally I've debugged this case and sent a detailed analysis to the 
maintainer, so he knows about the problem.

In short, you cannot assume in Windows that different applications use 
the same system encoding. That is not true at least with the invention 
of the fusion manifests which allow an application to switch to UTF-8 as 
system encoding, which R does. So, when using an external application on 
Windows, you need to know and respect a specific encoding used by that 
application on input and output.

As an example based on processx, you have an application which prints 
its argument to standard output. If you do it this way:

$ cat pr.c
#include <stdio.h>
#include <locale.h>
#include <string.h>
int main(int argc, char **argv) {

 ??????? printf("Locale set to: %s\n", setlocale(LC_ALL,
""));
 ??????? int i;
 ??????? for(i = 0; i < argc; i++) {
 ??????????????? printf("Argument %d\n", i);
 ??????????????? printf("%s\n", argv[i]);
 ??????????????? for(int j = 0; j < strlen(argv[i]); j++) {
 ??????????????????????? printf("byte[%d] is %x (%d)\n", i, (unsigned 
char)argv[i][j], (unsigned char)
 ??????????????? }
 ??????? }
 ??????? return 0;
}

the argument and hence output will be in the current native encoding of 
pr.c, because that's the encoding in which the argument will be received 
from Windows, so by default the system locale encoding, so by default 
not UTF-8 (on my system in Latin-1, as well as on CRAN check systems). 
One should also only use such programs with characters representable in 
Latin-1 on such systems. When you call such application from R with 
UTF-8 as native encoding, Windows will automatically convert the 
arguments to Latin-1.

The old Windows way to avoid this problem is to use the wide-character 
API (now UTF-16LE):

$ cat prw.c
#include <stdio.h>
#include <locale.h>
#include <string.h>

int wmain(int argc, wchar_t **argv) {

 ??????? int i;
 ??????? for(i = 0; i < argc; i++) {
 ??????????????? wprintf(L"Argument %d\n", i);
 ??????????????? wprintf(argv[i]);
 ??????????????? wprintf(L"\n");
 ??????????????? for(int j = 0; j < wcslen(argv[i]); j++)
 ??????????????????????? wprintf(L"Word[%d] %x\n", j, 
(unsigned)argv[i][j]);
 ??????? }
 ??????? return 0;
}

When you call such program from R with UTF-8 as native encoding, Windows 
will convert the arguments to UTF-16LE (so all characters will be 
representable). But you need to write Windows-specific code for this.

The new Windows way to avoid this problem is to use UTF-8 as the native 
encoding via the fusion manifest, as R does. You can use the "pr.c" as
above, but with something like

$ cat pr.rc
#include <windows.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"

$ cat pr.manifest
<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1"
manifestVersion="1.0">
<assemblyIdentity
 ??? version="1.0.0.0"
 ??? processorArchitecture="amd64"
 ??? name="pr.exe"
 ??? type="win32"
/>
<application>
 ? <windowsSettings>
 ??? <activeCodePage 
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
 ? </windowsSettings>
</application>
</assembly>

windres.exe -i pr.rc -o pr_rc.o
gcc -o pr pr.c pr_rc.o

When you build the application this way, it will use UTF-8 as native 
encoding, so when you call it from R (with UTF-8) as native encoding, no 
input conversion will occur. However, when you do this, the output from 
the application will also be in UTF-8.

So, for applications you control, my recommendation would be to make 
them use Unicode one of these two ways. Preferably the new one, with the 
fusion manifest. Only if it were a Windows-only application, and had to 
work on older Windows, then the wide-character version (but such apps 
are probably not in R packages).

When working with external applications you don't control, it is harder 
- you need to know which encoding they are expecting and producing, in 
whatever interface you use, and convert that, e.g. using iconv(). By the 
interface I mean that e.g., the command-line arguments are converted by 
Windows, but the input/output sent over a file/stream will not be.

Of course, this works the other way around as well. If you were using R 
with some other external applications expecting a different encoding, 
you would need to handle that (by conversions). With applications you 
control, it would make sense using this opportunity to switch to UTF-8. 
But, in principle, you can use iconv() from R directly or indirectly to 
convert input/output streams to/from a known encoding.

I am happy to give more suggestions if there is interest, but for that 
it would be useful to have a specific example (with processx, it is 
clear what the options R, there the application is controlled by the 
package).

Best
Tomas>
> Any suggestions?
>
> Best,
> Yutani
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Hiroaki Yutani

2021-Dec-21 14:47 UTC

head link

[Rd] R on Windows with UCRT and the system encoding

Hi Tomas,

Thank you very much for the detailed explanation! I think now I have a
bit better understanding on how the things work; at least now I know I
didn't understand the concept of "active code page". I'll
follow your
advice when I need to fix the packages that need some tweaks to handle
UTF-8 properly.

Sorry, I'd like to ask one more question related to locale. If I copy
the following text and execute `read.csv("clipboard")`, it returns
"uao" instead of "???" (the characters are transliterated).

    "col1","col2"
    "???","???"


While this is probably the status quo (the same behavior on R 4.1) on
Latin-1 encoding, things are worse on CJK locales. If I try,

    "col1","col2"
    "?","?"

I get the following error:

    > read.csv("clipboard")
    Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  :
      invalid multibyte string at '<82><a0>'

Is this supposed to work? It seems the characters are encoded as CP932
(my system locale) but marked as UTF-8.

    > x <- utils:::readClipboard()
    > x
    [1] "\"col1\",\"col2\""        
"\"\x82\xa0\",\"\x82\xa2\""
    > iconv(x, from = "CP932", to = "UTF-8")
    [1] "\"col1\",\"col2\""
"\"?\",\"?\""

I read the source code of readClipboard() in
src/library/utils/src/windows/util.c, but have no idea if there's
anything that needs to be fixed.

Best,
Yutani

2021?12?21?(?) 17:26 Tomas Kalibera <tomas.kalibera at gmail.com>:




>
> Hi Yutani,
>
> On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
> > Hi,
> >
> > I'm more than excited about the announcement about the upcoming
UTF-8
> > R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
> > work on Windows with non-UTF-8 encoding as the system locale? I think
> > this blog post indicates so (as this describes the older Windows than
> > the UTF-8 era), but I'm not fully confident if I understand the
> > details correctly.
>
> R 4.2 will automatically use UTF-8 as the active code page (system
> locale) and the C library encoding and the R current native encoding on
> systems which allow this (recent Windows 10 and newer, Windows Server
> 2022, etc). There is no way to opt-out from that, and of course no
> reason to, either. It does not matter of what is the system locale set
> in Windows for the whole system - these recent Windows allow individual
> applications to override the system-wide setting to UTF-8, which is what
> R does. Typically the system-wide setting will not be UTF-8, because
> many applications will not work with that.
>
> On older systems, R 4.2 will run in some other system locale and the
> same C library encoding and R current native encoding - the same system
> default as R 4.1 would run on that system. So for some time, encoding
> support for this in R will have to stay, but eventually will be removed.
> But yes, R 4.2 is still supposed to work on such systems.
>
> >
https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
> >
> > If so, I'm curious what the package authors should do when the
locales
> > are different between OS and R. For example (disclaimer: I don't
> > intend to blame processx at all. Just for an example), the CRAN check
> > on the processx package currently fails with this warning on R-devel
> > Windows.
> >
> >>      1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte
character at end of stream ignored
> > https://cran.r-project.org/web/checks/check_results_processx.html
> >
> > As far as I know, processx launches an external process and captures
> > its output, and I suspect the problem is that the output of the
> > process is encoded in non-UTF-8 while R assumes it's UTF-8. I
> > experienced similar problems with other packages as well, which
> > disappear if I switch the locale to the same one as the OS by
> > Sys.setlocale(). So, I think it would be great if there's some
> > guidance for the package authors on how to handle these properly.
>
> Incidentally I've debugged this case and sent a detailed analysis to
the
> maintainer, so he knows about the problem.
>
> In short, you cannot assume in Windows that different applications use
> the same system encoding. That is not true at least with the invention
> of the fusion manifests which allow an application to switch to UTF-8 as
> system encoding, which R does. So, when using an external application on
> Windows, you need to know and respect a specific encoding used by that
> application on input and output.
>
> As an example based on processx, you have an application which prints
> its argument to standard output. If you do it this way:
>
> $ cat pr.c
> #include <stdio.h>
> #include <locale.h>
> #include <string.h>
> int main(int argc, char **argv) {
>
>          printf("Locale set to: %s\n", setlocale(LC_ALL,
""));
>          int i;
>          for(i = 0; i < argc; i++) {
>                  printf("Argument %d\n", i);
>                  printf("%s\n", argv[i]);
>                  for(int j = 0; j < strlen(argv[i]); j++) {
>                          printf("byte[%d] is %x (%d)\n", i,
(unsigned
> char)argv[i][j], (unsigned char)
>                  }
>          }
>          return 0;
> }
>
> the argument and hence output will be in the current native encoding of
> pr.c, because that's the encoding in which the argument will be
received
> from Windows, so by default the system locale encoding, so by default
> not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
> One should also only use such programs with characters representable in
> Latin-1 on such systems. When you call such application from R with
> UTF-8 as native encoding, Windows will automatically convert the
> arguments to Latin-1.
>
> The old Windows way to avoid this problem is to use the wide-character
> API (now UTF-16LE):
>
> $ cat prw.c
> #include <stdio.h>
> #include <locale.h>
> #include <string.h>
>
> int wmain(int argc, wchar_t **argv) {
>
>          int i;
>          for(i = 0; i < argc; i++) {
>                  wprintf(L"Argument %d\n", i);
>                  wprintf(argv[i]);
>                  wprintf(L"\n");
>                  for(int j = 0; j < wcslen(argv[i]); j++)
>                          wprintf(L"Word[%d] %x\n", j,
> (unsigned)argv[i][j]);
>          }
>          return 0;
> }
>
> When you call such program from R with UTF-8 as native encoding, Windows
> will convert the arguments to UTF-16LE (so all characters will be
> representable). But you need to write Windows-specific code for this.
>
> The new Windows way to avoid this problem is to use UTF-8 as the native
> encoding via the fusion manifest, as R does. You can use the
"pr.c" as
> above, but with something like
>
> $ cat pr.rc
> #include <windows.h>
> CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
>
> $ cat pr.manifest
> <?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
> <assembly xmlns="urn:schemas-microsoft-com:asm.v1"
manifestVersion="1.0">
> <assemblyIdentity
>      version="1.0.0.0"
>      processorArchitecture="amd64"
>      name="pr.exe"
>      type="win32"
> />
> <application>
>    <windowsSettings>
>      <activeCodePage
>
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
>    </windowsSettings>
> </application>
> </assembly>
>
> windres.exe -i pr.rc -o pr_rc.o
> gcc -o pr pr.c pr_rc.o
>
> When you build the application this way, it will use UTF-8 as native
> encoding, so when you call it from R (with UTF-8) as native encoding, no
> input conversion will occur. However, when you do this, the output from
> the application will also be in UTF-8.
>
> So, for applications you control, my recommendation would be to make
> them use Unicode one of these two ways. Preferably the new one, with the
> fusion manifest. Only if it were a Windows-only application, and had to
> work on older Windows, then the wide-character version (but such apps
> are probably not in R packages).
>
> When working with external applications you don't control, it is harder
> - you need to know which encoding they are expecting and producing, in
> whatever interface you use, and convert that, e.g. using iconv(). By the
> interface I mean that e.g., the command-line arguments are converted by
> Windows, but the input/output sent over a file/stream will not be.
>
> Of course, this works the other way around as well. If you were using R
> with some other external applications expecting a different encoding,
> you would need to handle that (by conversions). With applications you
> control, it would make sense using this opportunity to switch to UTF-8.
> But, in principle, you can use iconv() from R directly or indirectly to
> convert input/output streams to/from a known encoding.
>
> I am happy to give more suggestions if there is interest, but for that
> it would be useful to have a specific example (with processx, it is
> clear what the options R, there the application is controlled by the
> package).
>
> Best
> Tomas
> >
> > Any suggestions?
> >
> > Best,
> > Yutani
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Dec 2021 - R on Windows with UCRT and the system encoding

[Rd] R on Windows with UCRT and the system encoding

[Rd] R on Windows with UCRT and the system encoding