thr3ads.net - R devel - [Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Владимир Панфилов

2017-Aug-28 09:27 UTC

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

Hello,

I do not have an account on R Bugzilla, so I will post my bug report here.
I want to report a very old bug in base R *source()* function. It relates
to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
reason if the UTF-8 script is containing cyrillic letter *"?"*, the
script
execution is interrupted directly on this letter (btw the same scripts are
sourcing fine when they are encoded in the systems CP1251 encoding).

Let's consider the following script that prints random russian words:

>
>
>
*print("?????")print("????")print("???????")print("????")*

When this script is sourced we get INCOMPLETE_STRING error:

>
>
>
>
> *source('D:/R code/test_cyr_letter.R', encoding = 'UTF-8',
echo=TRUE)Error
> in source("D:/R code/test_cyr_letter.R", encoding =
"UTF-8", echo = TRUE)
> :   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
> print("????")3: print("??         ^*

Note that this bug is not triggered when the same file is executed using
*eval(parse(...))*:

>
>
>
> *> eval(parse('D:/R code/test_cyr_letter.R',
encoding="UTF-8"))[1]
> "?????"[1] "????"[1] "???????"[1]
"????"*

I made some reserach and noticed that *source* and *parse* functions have
similar parts of code for reading files. After analyzing code of *source()*
function I found out that commenting one line from it fixes this bug and
the overrided function works fine. See this part of *source()* function
code:

*... *>
> *filename <- file*
>
> *        file <- file(filename, "r")*
>
> *        # on.exit(close(file))  #### COMMENT THIS LINE ####*
>
> *        if (isTRUE(keep.source)) {*
>
> *          lines <- scan(file, what="character", encoding =
encoding, sep
>> = "\n")*
>
> *          on.exit()*
>
> *          close(file)*
>
> *          srcfile <- srcfilecopy(filename, lines,
>> file.mtime(filename)[1], *
>
> *                                 isFile = TRUE)*
>
> *        } *
>
> *...*
>
>I do not fully understand this weird behaviour, so I ask help of R Core
developers to fix this annoying bug that prevents using unicode scripts
with cyrillic on Windows.
Maybe you should make that part of *source()* function read files like
*parse()* function?

*Session and encoding info:*
> > sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 7 x64 (build 7601) Service Pack 1
> Matrix products: default
> locale:
> [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
>  LC_MONETARY=Russian_Russia.1251
> [4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> loaded via a namespace (and not attached):
> [1] compiler_3.4.1 tools_3.4.1

> > l10n_info()
> $MBCS
> [1] FALSE
> $`UTF-8`
> [1] FALSE
> $`Latin-1`
> [1] FALSE
> $codepage
> [1] 1251

Patrick Perry

2017-Aug-28 12:24 UTC

head link

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

My understanding (which could be wrong) is that when you source a file, 
it first gets translated to your native locale and then parsed. When you 
parse a character vector, it does not get translated.

In your locale, every "?" character  (U+044F) gets replaced by the
byte
"\xFF":
>  iconv("\u044f", "UTF-8", "Windows-1251")[1] "\xff"

I suspect that particular value causes trouble for the R parser, which 
uses a stack of previously-seen characters (include/Defn.h):

LibExtern char    R_ParseContext[PARSE_CONTEXT_SIZE] INI_as("");

And at various places checks whether the context character is EOF. That 
character is defined as

#define R_EOF    -1

Which, when cast to a char, is 0xFF.

I suspect that your example is revealing two bugs:

1) The R parser seems to have trouble with native characters encoded as 
0xFF. It's possible that, since R strings can't contain 0x00, this can 
be fixed by changing the definition of R_EOF to

#define R_EOF     0

2) The other bug is that, as I understand the situation, "source" will
fail if the file contains a character that cannot be represented in your 
native locale. This is a harder bug to tackle because of the way file() 
and the other connection methods are designed, where they translate the 
input to your native locale. I don't know if it's possible to override 
this behavior, and have them translate input to UTF-8 instead.

Patrick

---

On Mon Aug 28 11:27:07 CEST 2017 ???????? ???????? 
<vladimirpanfilov at gmail.com> wrote:

Hello,

I do not have an account on R Bugzilla, so I will post my bug report here.
I want to report a very old bug in base R *source()* function. It relates
to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
reason if the UTF-8 script is containing cyrillic letter *"?"*, the
script
execution is interrupted directly on this letter (btw the same scripts are
sourcing fine when they are encoded in the systems CP1251 encoding).

Let's consider the following script that prints random russian words:

>//>/
/>/ 
*print("?????")print("????")print("???????")print("????")*
/

When this script is sourced we get INCOMPLETE_STRING error:

>//>/
/>/
/>/
/>/  *source('D:/R code/test_cyr_letter.R', encoding =
'UTF-8', echo=TRUE)Error
/>/  in source("D:/R code/test_cyr_letter.R", encoding =
"UTF-8", echo = TRUE)
/>/  :   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
/>/  print("????")3: print("??         ^*
/

Note that this bug is not triggered when the same file is executed using
*eval(parse(...))*:

>//>/
/>/
/>/  *>  eval(parse('D:/R code/test_cyr_letter.R',
encoding="UTF-8"))[1]
/>/  "?????"[1] "????"[1] "???????"[1]
"????"*
/

I made some reserach and noticed that *source* and *parse* functions have
similar parts of code for reading files. After analyzing code of *source()*
function I found out that commenting one line from it fixes this bug and
the overrided function works fine. See this part of *source()* function
code:

*... *>//>/  *filename<- file*
/>/
/>/  *        file<- file(filename, "r")*
/>/
/>/  *        # on.exit(close(file))  #### COMMENT THIS LINE ####*
/>/
/>/  *        if (isTRUE(keep.source)) {*
/>/
/>/  *          lines<- scan(file, what="character", encoding =
encoding, sep
/>>/  = "\n")*
/>/
/>/  *          on.exit()*
/>/
/>/  *          close(file)*
/>/
/>/  *          srcfile<- srcfilecopy(filename, lines,
/>>/  file.mtime(filename)[1], *
/>/
/>/  *                                 isFile = TRUE)*
/>/
/>/  *        } *
/>/
/>/  *...*
/>/
/>/
/I do not fully understand this weird behaviour, so I ask help of R Core
developers to fix this annoying bug that prevents using unicode scripts
with cyrillic on Windows.
Maybe you should make that part of *source()* function read files like
*parse()* function?

*Session and encoding info:*
>/  >  sessionInfo()/>/  R version 3.4.1 (2017-06-30)
/>/  Platform: x86_64-w64-mingw32/x64 (64-bit)
/>/  Running under: Windows 7 x64 (build 7601) Service Pack 1
/>/  Matrix products: default
/>/  locale:
/>/  [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
/>/   LC_MONETARY=Russian_Russia.1251
/>/  [4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
/>/  attached base packages:
/>/  [1] stats     graphics  grDevices utils     datasets  methods   base
/>/  loaded via a namespace (and not attached):
/>/  [1] compiler_3.4.1 tools_3.4.1
/

>/  >  l10n_info()/>/  $MBCS
/>/  [1] FALSE
/>/  $`UTF-8`
/>/  [1] FALSE
/>/  $`Latin-1`
/>/  [1] FALSE
/>/  $codepage
/>/  [1] 1251/

	[[alternative HTML version deleted]]

Tomas Kalibera

2018-Apr-09 08:00 UTC

head link

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

Hi Vladimir,

thanks for your report - this was really a bug, now fixed in R-devel and 
to appear in 3.5.0.

Apart from the bug, having source files in UTF-8 and reading them into R 
on Windows is perfectly fine, you just need to specify that they are in 
UTF-8. You also need to make sure R is running in Russian locale 
(CP1251) if that is not the default. On my system, this works fine

Sys.setlocale(locale="Russian")
source("russian_utf8.R", encoding="UTF-8")

Best
Tomas


On 08/28/2017 11:27 AM, ???????? ???????? wrote:> Hello,
>
> I do not have an account on R Bugzilla, so I will post my bug report here.
> I want to report a very old bug in base R *source()* function. It relates
> to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
> reason if the UTF-8 script is containing cyrillic letter *"?"*,
the script
> execution is interrupted directly on this letter (btw the same scripts are
> sourcing fine when they are encoded in the systems CP1251 encoding).
>
> Let's consider the following script that prints random russian words:
>
>
>>
>>
*print("?????")print("????")print("???????")print("????")*
>
> When this script is sourced we get INCOMPLETE_STRING error:
>
>
>>
>>
>>
>> *source('D:/R code/test_cyr_letter.R', encoding =
'UTF-8', echo=TRUE)Error
>> in source("D:/R code/test_cyr_letter.R", encoding =
"UTF-8", echo = TRUE)
>> :   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
>> print("????")3: print("??         ^*
>
> Note that this bug is not triggered when the same file is executed using
> *eval(parse(...))*:
>
>
>>
>>
>> *> eval(parse('D:/R code/test_cyr_letter.R',
encoding="UTF-8"))[1]
>> "?????"[1] "????"[1] "???????"[1]
"????"*
>
> I made some reserach and noticed that *source* and *parse* functions have
> similar parts of code for reading files. After analyzing code of *source()*
> function I found out that commenting one line from it fixes this bug and
> the overrided function works fine. See this part of *source()* function
> code:
>
> *... *
>> *filename <- file*
>>
>> *        file <- file(filename, "r")*
>>
>> *        # on.exit(close(file))  #### COMMENT THIS LINE ####*
>>
>> *        if (isTRUE(keep.source)) {*
>>
>> *          lines <- scan(file, what="character", encoding
= encoding, sep
>>> = "\n")*
>> *          on.exit()*
>>
>> *          close(file)*
>>
>> *          srcfile <- srcfilecopy(filename, lines,
>>> file.mtime(filename)[1], *
>> *                                 isFile = TRUE)*
>>
>> *        } *
>>
>> *...*
>>
>>
> I do not fully understand this weird behaviour, so I ask help of R Core
> developers to fix this annoying bug that prevents using unicode scripts
> with cyrillic on Windows.
> Maybe you should make that part of *source()* function read files like
> *parse()* function?
>
> *Session and encoding info:*
>
>>> sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> Running under: Windows 7 x64 (build 7601) Service Pack 1
>> Matrix products: default
>> locale:
>> [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
>>   LC_MONETARY=Russian_Russia.1251
>> [4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.1 tools_3.4.1
>
>
>>> l10n_info()
>> $MBCS
>> [1] FALSE
>> $`UTF-8`
>> [1] FALSE
>> $`Latin-1`
>> [1] FALSE
>> $codepage
>> [1] 1251
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Tomas Kalibera

2018-Apr-09 08:42 UTC

head link

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

Hi Patrick,

thanks for your comments on the bug, just to clarify - one could 
reproduce the bug simply using file() and readLines(). The parser saw a 
real end of file as (incorrectly) communicated to it by lower level 
connections code - there is no design issue related in the parser (nor 
elsewhere), it was a bug in connections code and is now fixed.

You can specify source encoding in "file()" or "source()" to
tell R that
the source file is in that given encoding. R will convert the file 
contents to the current native encoding of the R session. If in doubt, 
please check the documentation ?file, ?source, ?readLines, ?Encoding for 
the details.

The observation that "?" is represented as 0xff (-1 as signed char)
and
R_EOF/EOF is -1 (but integer) was related to the bug, well spotted.

Best
Tomas

On 08/28/2017 02:24 PM, Patrick Perry wrote:> My understanding (which could be wrong) is that when you source a file,
> it first gets translated to your native locale and then parsed. When you
> parse a character vector, it does not get translated.
>
> In your locale, every "?" character  (U+044F) gets replaced by
the byte
> "\xFF":
>
>>   iconv("\u044f", "UTF-8",
"Windows-1251")
> [1] "\xff"
>
> I suspect that particular value causes trouble for the R parser, which
> uses a stack of previously-seen characters (include/Defn.h):
>
> LibExtern char    R_ParseContext[PARSE_CONTEXT_SIZE] INI_as("");
>
> And at various places checks whether the context character is EOF. That
> character is defined as
>
> #define R_EOF    -1
>
> Which, when cast to a char, is 0xFF.
>
> I suspect that your example is revealing two bugs:
>
> 1) The R parser seems to have trouble with native characters encoded as
> 0xFF. It's possible that, since R strings can't contain 0x00, this
can
> be fixed by changing the definition of R_EOF to
>
> #define R_EOF     0
>
>
> 2) The other bug is that, as I understand the situation, "source"
will
> fail if the file contains a character that cannot be represented in your
> native locale. This is a harder bug to tackle because of the way file()
> and the other connection methods are designed, where they translate the
> input to your native locale. I don't know if it's possible to
override
> this behavior, and have them translate input to UTF-8 instead.
>
>
>
> Patrick
>
>
> ---
>
> On Mon Aug 28 11:27:07 CEST 2017 ???????? ????????
> <vladimirpanfilov at gmail.com> wrote:
>
> Hello,
>
> I do not have an account on R Bugzilla, so I will post my bug report here.
> I want to report a very old bug in base R *source()* function. It relates
> to sourcing some R scripts in UTF-8 encoding on Windows machines. For some
> reason if the UTF-8 script is containing cyrillic letter *"?"*,
the script
> execution is interrupted directly on this letter (btw the same scripts are
> sourcing fine when they are encoded in the systems CP1251 encoding).
>
> Let's consider the following script that prints random russian words:
>
>
>> /
> />/
> />/ 
*print("?????")print("????")print("???????")print("????")*
> /
>
> When this script is sourced we get INCOMPLETE_STRING error:
>
>
>> /
> />/
> />/
> />/
> />/  *source('D:/R code/test_cyr_letter.R', encoding =
'UTF-8', echo=TRUE)Error
> />/  in source("D:/R code/test_cyr_letter.R", encoding =
"UTF-8", echo = TRUE)
> />/  :   D:/R code/test_cyr_letter.R:3:7: unexpected INCOMPLETE_STRING2:
> />/  print("????")3: print("??         ^*
> /
>
> Note that this bug is not triggered when the same file is executed using
> *eval(parse(...))*:
>
>
>> /
> />/
> />/
> />/  *>  eval(parse('D:/R code/test_cyr_letter.R',
encoding="UTF-8"))[1]
> />/  "?????"[1] "????"[1] "???????"[1]
"????"*
> /
>
> I made some reserach and noticed that *source* and *parse* functions have
> similar parts of code for reading files. After analyzing code of *source()*
> function I found out that commenting one line from it fixes this bug and
> the overrided function works fine. See this part of *source()* function
> code:
>
> *... *
>> /
> />/  *filename<- file*
> />/
> />/  *        file<- file(filename, "r")*
> />/
> />/  *        # on.exit(close(file))  #### COMMENT THIS LINE ####*
> />/
> />/  *        if (isTRUE(keep.source)) {*
> />/
> />/  *          lines<- scan(file, what="character",
encoding = encoding, sep
> />>/  = "\n")*
> />/
> />/  *          on.exit()*
> />/
> />/  *          close(file)*
> />/
> />/  *          srcfile<- srcfilecopy(filename, lines,
> />>/  file.mtime(filename)[1], *
> />/
> />/  *                                 isFile = TRUE)*
> />/
> />/  *        } *
> />/
> />/  *...*
> />/
> />/
> /I do not fully understand this weird behaviour, so I ask help of R Core
> developers to fix this annoying bug that prevents using unicode scripts
> with cyrillic on Windows.
> Maybe you should make that part of *source()* function read files like
> *parse()* function?
>
> *Session and encoding info:*
>
>> /  >  sessionInfo()
> />/  R version 3.4.1 (2017-06-30)
> />/  Platform: x86_64-w64-mingw32/x64 (64-bit)
> />/  Running under: Windows 7 x64 (build 7601) Service Pack 1
> />/  Matrix products: default
> />/  locale:
> />/  [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251
> />/   LC_MONETARY=Russian_Russia.1251
> />/  [4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251
> />/  attached base packages:
> />/  [1] stats     graphics  grDevices utils     datasets  methods  
base
> />/  loaded via a namespace (and not attached):
> />/  [1] compiler_3.4.1 tools_3.4.1
> /
>
>
>> /  >  l10n_info()
> />/  $MBCS
> />/  [1] FALSE
> />/  $`UTF-8`
> />/  [1] FALSE
> />/  $`Latin-1`
> />/  [1] FALSE
> />/  $codepage
> />/  [1] 1251/
>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Apparently Analagous Threads

Search for more seemingly similar threads

R devel - Aug 2017 - [bug report] Cyrillic letter "я" interrupts script execution via R source function

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

[Rd] [bug report] Cyrillic letter "я" interrupts script execution via R source function

Apparently Analagous Threads