Azure
2018-Mar-08 17:54 UTC
[Rd] [Bug report] Chinese characters are not handled correctly in Rterm for Windows
Hello everyone, I am new to R and I have experienced some bugs when using Rterm on Windows. Chinese characters in the console output are discarded by Rterm, and trying to type them into the console will crash the Rterm application. ---ENVIRONMENT--- Platform = x86_64-w64-mingw32 OS = Windows 10 Pro 1709 chs R version = 3.4.3 Active code page = 936 (Simplified Chinese) ---STEPS TO REPRODUCE--- 1. Run cmd and start bin\x64\R.exe 2. Note that all Chinese characters in the startup banner are missing 3. > Sys.getlocale() [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified) _China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_ TIME=Chinese (Simplified)_China.936" 4. > print("ABC\u4f60\u597dDEF") [1] "ABCDEF" (Unicode code points for "???") 5. Use Microsoft Pinyin IME to type "???" into the console. An error message appeared:> invalid multibyte character in mbcs_get_nextThen the program crashed. My debugger reported a heap corruption, displayed as follows: 0x00007FFE2F3687BB (ntdll.dll) (Rterm.exe ??)??????????????: 0xC0000374: ??????? (????: 0x00007FFE2F3CC6E0)?? However, if the text is pasted into the console, it will not crash. ---ADDITIONAL INFO--- Both 32-bit and 64-bit versions have the same problem. I attached a debugger to observe Rterm's behavior. The command in step 4 produced the following calling sequence of C library function "fputc": fputc ( 91, 0x00007ffe2d1aea40 ) //'[' fputc ( 49, 0x00007ffe2d1aea40 ) //'1' fputc ( 93, 0x00007ffe2d1aea40 ) //']' //fflush ( 0x00007ffe2d1aea40 ) fputc ( 32, 0x00007ffe2d1aea40 ) //' ' fputc ( 34, 0x00007ffe2d1aea40 ) //'\"' fputc ( 65, 0x00007ffe2d1aea40 ) //'A' fputc ( 66, 0x00007ffe2d1aea40 ) //'B' fputc ( 67, 0x00007ffe2d1aea40 ) //'C' fputc ( 196, 0x00007ffe2d1aea40 ) //FAILED! fputc ( 227, 0x00007ffe2d1aea40 ) //FAILED! fputc ( 186, 0x00007ffe2d1aea40 ) //FAILED! fputc ( 195, 0x00007ffe2d1aea40 ) //FAILED! fputc ( 68, 0x00007ffe2d1aea40 ) //'D' fputc ( 69, 0x00007ffe2d1aea40 ) //'E' fputc ( 70, 0x00007ffe2d1aea40 ) //'F' fputc ( 34, 0x00007ffe2d1aea40 ) //'\"' //fflush ( 0x00007ffe2d1aea40 ) fputc ( 10, 0x00007ffe2d1aea40 ) //'\n' {196, 227, 186, 195} or {C4 E3 BA C3} is multi-byte-encoded "???" in GBK (Code page 936). These calls failed with a Windows error code 28 (No space left on device), while the subsequent calls to fputc succeeded. Then I used C++ to implement a terminal front-end with REmbedded facilities. R outputs were simply printf-ed to stdout. Everything worked as expected: Initializing R environment R version 3.4.3 detected> print("???????????????????R is great!")[1] "???????????????????R is great!"> Sys.getlocale()[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified) _China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_ TIME=Chinese (Simplified)_China.936">I hope these information are helpful. Best regards, AzureFx [[alternative HTML version deleted]]
Tomas Kalibera
2018-Apr-05 14:42 UTC
[Rd] [Bug report] Chinese characters are not handled correctly in Rterm for Windows
Thank you for the report and initial debugging. I am not sure what is going wrong, we may have to rely on your help to debug this (I do not have a system to reproduce on). A user-targeted advice would be to use RGui (Rgui.exe). Does the problem also exist in R-devel? https://cran.r-project.org/bin/windows/base/rdevel.html Your example? print("ABC\u4f60\u597dDEF") is printing two Chinese characters, right? The first one is C4E3 in CP936 (4F60 in Unicode) and the second one is BAC3 in CP936 (597D in Unicode)? Could you reproduce the problem with printing just one of the characters, say print("ABC\u4f60DEF") ? As a sanity check - does this display the correct characters in RGui? It should, and does on my system, as RGui uses Unicode internally. By correct I mean the characters shown e.g. here https://msdn.microsoft.com/en-us/library/cc194923.aspx https://msdn.microsoft.com/en-us/library/cc194920.aspx What is the output of "chcp" in the terminal, before you run R.exe? It may be different from what Sys.getlocale() gives in R. If you take the sequence of the "fputc" commands you captured by the debugger, and create a trivial console application to just run them - would the characters display correctly in the same terminal from which you run R.exe? Thanks Tomas On 03/08/2018 06:54 PM, Azure wrote:> Hello everyone, > > > I am new to R and I have experienced some bugs when using Rterm on Windows. > > Chinese characters in the console output are discarded by Rterm, and trying > > to type them into the console will crash the Rterm application. > > > ---ENVIRONMENT--- > > Platform = x86_64-w64-mingw32 > > OS = Windows 10 Pro 1709 chs > > R version = 3.4.3 > > Active code page = 936 (Simplified Chinese) > > > ---STEPS TO REPRODUCE--- > > 1. Run cmd and start bin\x64\R.exe > > > 2. Note that all Chinese characters in the startup banner are missing > > > 3. > Sys.getlocale() > > [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified) > _China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_ > TIME=Chinese (Simplified)_China.936" > > 4. > print("ABC\u4f60\u597dDEF") > [1] "ABCDEF" > (Unicode code points for "???") > > 5. Use Microsoft Pinyin IME to type "???" into the console. An error message appeared: >> invalid multibyte character in mbcs_get_next > Then the program crashed. My debugger reported a heap corruption, displayed as follows: > 0x00007FFE2F3687BB (ntdll.dll) (Rterm.exe ??)???????????????: 0xC0000374: ??????? (????: 0x00007FFE2F3CC6E0)?? > However, if the text is pasted into the console, it will not crash. > > ---ADDITIONAL INFO--- > Both 32-bit and 64-bit versions have the same problem. > I attached a debugger to observe Rterm's behavior. The command in step 4 > produced the following calling sequence of C library function "fputc": > > fputc ( 91, 0x00007ffe2d1aea40 ) //'[' > fputc ( 49, 0x00007ffe2d1aea40 ) //'1' > fputc ( 93, 0x00007ffe2d1aea40 ) //']' > //fflush ( 0x00007ffe2d1aea40 ) > fputc ( 32, 0x00007ffe2d1aea40 ) //' ' > fputc ( 34, 0x00007ffe2d1aea40 ) //'\"' > fputc ( 65, 0x00007ffe2d1aea40 ) //'A' > fputc ( 66, 0x00007ffe2d1aea40 ) //'B' > fputc ( 67, 0x00007ffe2d1aea40 ) //'C' > fputc ( 196, 0x00007ffe2d1aea40 ) //FAILED! > fputc ( 227, 0x00007ffe2d1aea40 ) //FAILED! > fputc ( 186, 0x00007ffe2d1aea40 ) //FAILED! > fputc ( 195, 0x00007ffe2d1aea40 ) //FAILED! > fputc ( 68, 0x00007ffe2d1aea40 ) //'D' > fputc ( 69, 0x00007ffe2d1aea40 ) //'E' > fputc ( 70, 0x00007ffe2d1aea40 ) //'F' > fputc ( 34, 0x00007ffe2d1aea40 ) //'\"' > //fflush ( 0x00007ffe2d1aea40 ) > fputc ( 10, 0x00007ffe2d1aea40 ) //'\n' > > {196, 227, 186, 195} or {C4 E3 BA C3} is multi-byte-encoded "???" in GBK (Code page 936). > These calls failed with a Windows error code 28 (No space left on device), while the subsequent > calls to fputc succeeded. > > Then I used C++ to implement a terminal front-end with REmbedded facilities. R outputs were > simply printf-ed to stdout. Everything worked as expected: > > Initializing R environment > R version 3.4.3 detected >> print("???????????????????R is great!") > [1] "???????????????????R is great!" >> Sys.getlocale() > [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified) > _China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_ > TIME=Chinese (Simplified)_China.936" > I hope these information are helpful. > > Best regards, > AzureFx > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]
Azure
2018-Apr-28 14:53 UTC
[Rd] [Bug report] Chinese characters are not handled correctly in Rterm for Windows
Hi Tomas, Sorry for the delayed response. I have tested the problem on the latest R-devel build (2018-04-27 r74651), and it still exists. RGui is always fine with Chinese characters, but some IDEs rely on the CLI version of R (e.g. Visual Studio Code with R plugin).>Your example print("ABC\u4f60\u597dDEF") is printing two Chinese characters, right?Yes. U+4F60, U+597D or C4E3, BAC3 in CP936.>Could you reproduce the problem with printing just one of the characters, say print("ABC\u4f60DEF") ?Yes. The console output is pasted in [ https://paste.ubuntu.com/p/TYgZWhdgXK/ ] (to avoid gibberish in e-mail). The Active Code Page is 936 before and after running Rterm.>As a sanity check - does this display the correct characters in RGui?Yes.>If you take the sequence of the "fputc" commands you captured by the debugger, and create a trivial console application to just run them - would the characters display correctly in the same terminal from which you run R.exe?Yes. I created an Win32 Console Application in VS [ https://paste.ubuntu.com/p/h3NFV6nQvs/ ], and all the characters were displayed correctly in two ways. The WriteConsoleA variant uses the current console CP settings, and it should behave like fputc. I guess the Rterm uses its own console I/O mechanism so the 2nd parameter of fputc is not stdout's handle. (I tried to read the source but unable to figure out how it works). The crash in mbcs_get_next, which is also mentioned in the previous post, may be related to this mechanism. If you need further information, please let me know. Thanks, i at azurefx.name Tomas Kalibera <tomas.kalibera at gmail.com> 2018/4/5 22:42> > >Thank you for the report and initial debugging. I am not sure what is going wrong, we may have to rely on your help to debug this (I do not have a system to reproduce on). A user-targeted advice would be to use RGui (Rgui.exe). > >Does the problem also exist in R-devel? >https://cran.r-project.org/bin/windows/base/rdevel.html > >Your example print("ABC\u4f60\u597dDEF") is printing two Chinese characters, right? The first one is C4E3 in CP936 (4F60 in Unicode) and the second one is BAC3 in CP936 (597D in Unicode)? Could you reproduce the problem with printing just one of the characters, say print("ABC\u4f60DEF") ? > >As a sanity check - does this display the correct characters in RGui? It should, and does on my system, as RGui uses Unicode internally. By correct I mean the characters shown e.g. here > >https://msdn.microsoft.com/en-us/library/cc194923.aspx >https://msdn.microsoft.com/en-us/library/cc194920.aspx > >What is the output of "chcp" in the terminal, before you run R.exe? It may be different from what Sys.getlocale() gives in R. > >If you take the sequence of the "fputc" commands you captured by the debugger, and create a trivial console application to just run them - would the characters display correctly in the same terminal from which you run R.exe? > >Thanks >Tomas > > >
Apparently Analagous Threads
- [Bug report] Chinese characters are not handled correctly in Rterm for Windows
- [Bug report] Chinese characters are not handled correctly in Rterm for Windows
- [Bug report] Chinese characters are not handled correctly in Rterm for Windows
- fseek/fgetc puzzle
- plain text in Chinese can not be set