thr3ads.net - R devel - [Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code? [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2014-Dec-11 17:59 UTC

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

SUGGESTION:
Would it make sense if install.packages() and friends always use an
"ascii"(*) encoding when parse():ing R package source code files?

I believe this should be safe, because R code files should be in ASCII
[http://en.wikipedia.org/wiki/ASCII] and only in source-code comments
you may use other characters.  This is from Section 'Package
subdirectories' in 'Writing R Extensions':

"Only ASCII characters (and the control characters tab, formfeed, LF
and CR) should be used in code files. Other characters are accepted in
comments, but then the comments may not be readable in e.g. a UTF-8
locale. Non-ASCII characters in object names will normally fail when
the package is installed. Any byte will be allowed in a quoted
character string but \uxxxx escapes should be used for non-ASCII
characters. However, non-ASCII character strings may not be usable in
some locales and may display incorrectly in others."

Since comments are dropped by parse(), their actual content does not
matter, and the rest of the code should be in ASCII.

(*) It could be that the specific encoding "ascii" is not cross
platforms. If so, is there another way to specify a pure ASCII
encoding?



BACKGROUND:
If a user/system sets the 'encoding' option at startup, it may break
package installations from source if the package has source code
comments with non-ASCII characters.  For example,

$ mkdir foo; cd foo
$ echo "options(encoding='UTF-8')" > .Rprofile
$ R --vanilla> install.packages("R.oo", type="source")
> install.packages("R.oo", type="source")Installing package into 'C:/Users/hb/R/win-library/3.2'
(as 'lib' is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz'
Content type 'application/x-gzip' length 394545 bytes (385 KB)
opened URL
downloaded 385 KB

* installing *source* package 'R.oo' ...
** package 'R.oo' successfully unpacked and MD5 sums checked
** R
Warning in parse(outFile) :
  invalid input found on input connection
'C:/Users/hb/R/win-library/3.2/R.oo/R/
R.oo'
** inst
** preparing package for lazy loading
Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
  invalid input found on input connection
'C:/Users/hb/R/win-library/3.2/R.oo/R/
R.oo'
** help
[...]

(This can be an extremely time consuming task to troubleshoot,
particularly if reported to a package maintainer not having access to
the original system).

FYI, setting it only in the session is alright:
> options(encoding="UTF-8")
> install.packages("R.oo", type="source")
because install.packages() launches a separated R process for the
installation and it's only then the startup code becomes an issue.


TROUBLESHOOTING:
My understanding for the

Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) :
  invalid input found on input connection
'C:/Users/hb/R/win-library/3.2/R.oo/R/

is that this happens when there is a non-ASCII character in one of the
source-code comments (*) with a bit pattern matching a multi-byte
UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description].  For
instance, consider a source code comment with an acute accent:
> raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e,
0x74, 0x0a))
> writeBin(raw, con="foo.R")
> code <- readLines("foo.R")
> code[1] "# ?tudiant"
> options(encoding="UTF-8")
> parse("foo.R")Warning message:
In readLines(file, warn = FALSE) :
  invalid input found on input connection 'foo.R'
> options(encoding="ascii")
> parse("foo.R")expression()

Reason for the "invalid input": The bit pattern for raw[3:5], is:
> R.utils::intToBin(raw[3:5])[1] "11101001" "01110100" "01110101"

The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx",
which according to UTF-8 should be followed by two more bytes with bit
patterns "10xxxxxx" and "10xxxxx"
[http://en.wikipedia.org/wiki/UTF-8#Description].  Since raw[4:5] does
not match those, it's an invalid UTF-8 byte sequence.  So, technically
this does not happen for all comments using acute accents, but it's
very likely.  More generally, a multi-byte UTF-8 sequence is expected
when byte pattern "11xxxxx" (>= 192 in decimal values) is
encountered.
Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several
characters with this bit pattern for many "Latin-N" encodings, which
I'd assume is still in dominant use by many developers.

So, since options(encoding="UTF-8") was set at startup, that is also
the encoding that R tries to follow.  My suggestion is that it seems
that R should be able to always use a pure-ASCII encoding when parsing
R code in packages, because that is what 'Writing R Extensions' says
we should use in the first place.

/Henrik

Duncan Murdoch

2014-Dec-11 18:47 UTC

head link

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

On 11/12/2014 12:59 PM, Henrik Bengtsson wrote:> SUGGESTION:
> Would it make sense if install.packages() and friends always use an
> "ascii"(*) encoding when parse():ing R package source code files?
I think that would be a step backwards.  It would be better to accept 
other encodings.  As an English speaker this isn't a big deal to me, but 
users of other languages may want to have messages and variable names in 
their native language, and ASCII might not be enough for that.

On the other hand, I think it's quite reasonable to require a declared 
encoding if anything other than ASCII is used, and possibly to fail for 
some encodings.  It is probably also reasonable to at least warn when 
non-ASCII characters are used in strings in packages on CRAN, as many 
users can't display all characters.

Duncan Murdoch>
> I believe this should be safe, because R code files should be in ASCII
> [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments
> you may use other characters.  This is from Section 'Package
> subdirectories' in 'Writing R Extensions':
>
> "Only ASCII characters (and the control characters tab, formfeed, LF
> and CR) should be used in code files. Other characters are accepted in
> comments, but then the comments may not be readable in e.g. a UTF-8
> locale. Non-ASCII characters in object names will normally fail when
> the package is installed. Any byte will be allowed in a quoted
> character string but \uxxxx escapes should be used for non-ASCII
> characters. However, non-ASCII character strings may not be usable in
> some locales and may display incorrectly in others."
>
> Since comments are dropped by parse(), their actual content does not
> matter, and the rest of the code should be in ASCII.
>
> (*) It could be that the specific encoding "ascii" is not cross
> platforms. If so, is there another way to specify a pure ASCII
> encoding?
>
>
>
> BACKGROUND:
> If a user/system sets the 'encoding' option at startup, it may
break
> package installations from source if the package has source code
> comments with non-ASCII characters.  For example,
>
> $ mkdir foo; cd foo
> $ echo "options(encoding='UTF-8')" > .Rprofile
> $ R --vanilla
> > install.packages("R.oo", type="source")
>
> > install.packages("R.oo", type="source")
> Installing package into 'C:/Users/hb/R/win-library/3.2'
> (as 'lib' is unspecified)
> --- Please select a CRAN mirror for use in this session ---
> trying URL
'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz'
> Content type 'application/x-gzip' length 394545 bytes (385 KB)
> opened URL
> downloaded 385 KB
>
> * installing *source* package 'R.oo' ...
> ** package 'R.oo' successfully unpacked and MD5 sums checked
> ** R
> Warning in parse(outFile) :
>    invalid input found on input connection
'C:/Users/hb/R/win-library/3.2/R.oo/R/
> R.oo'
> ** inst
> ** preparing package for lazy loading
> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE)
:
>    invalid input found on input connection
'C:/Users/hb/R/win-library/3.2/R.oo/R/
> R.oo'
> ** help
> [...]
>
> (This can be an extremely time consuming task to troubleshoot,
> particularly if reported to a package maintainer not having access to
> the original system).
>
> FYI, setting it only in the session is alright:
>
> > options(encoding="UTF-8")
> > install.packages("R.oo", type="source")
>
> because install.packages() launches a separated R process for the
> installation and it's only then the startup code becomes an issue.
>
>
> TROUBLESHOOTING:
> My understanding for the
>
> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE)
:
>    invalid input found on input connection
'C:/Users/hb/R/win-library/3.2/R.oo/R/
>
> is that this happens when there is a non-ASCII character in one of the
> source-code comments (*) with a bit pattern matching a multi-byte
> UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description].  For
> instance, consider a source code comment with an acute accent:
>
> > raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61,
0x6e, 0x74, 0x0a))
> > writeBin(raw, con="foo.R")
> > code <- readLines("foo.R")
> > code
> [1] "# ?tudiant"
>
> > options(encoding="UTF-8")
> > parse("foo.R")
> Warning message:
> In readLines(file, warn = FALSE) :
>    invalid input found on input connection 'foo.R'
>
> > options(encoding="ascii")
> > parse("foo.R")
> expression()
>
> Reason for the "invalid input": The bit pattern for raw[3:5], is:
>
> > R.utils::intToBin(raw[3:5])
> [1] "11101001" "01110100" "01110101"
>
> The first byte (raw[3]) matched special UTF-8 byte pattern
"1110xxxx",
> which according to UTF-8 should be followed by two more bytes with bit
> patterns "10xxxxxx" and "10xxxxx"
> [http://en.wikipedia.org/wiki/UTF-8#Description].  Since raw[4:5] does
> not match those, it's an invalid UTF-8 byte sequence.  So, technically
> this does not happen for all comments using acute accents, but it's
> very likely.  More generally, a multi-byte UTF-8 sequence is expected
> when byte pattern "11xxxxx" (>= 192 in decimal values) is
encountered.
> Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several
> characters with this bit pattern for many "Latin-N" encodings,
which
> I'd assume is still in dominant use by many developers.
>
> So, since options(encoding="UTF-8") was set at startup, that is
also
> the encoding that R tries to follow.  My suggestion is that it seems
> that R should be able to always use a pure-ASCII encoding when parsing
> R code in packages, because that is what 'Writing R Extensions'
says
> we should use in the first place.
>
> /Henrik
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Henrik Bengtsson

2014-Dec-11 20:28 UTC

head link

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

On Thu, Dec 11, 2014 at 10:47 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:> On 11/12/2014 12:59 PM, Henrik Bengtsson wrote:
>>
>> SUGGESTION:
>> Would it make sense if install.packages() and friends always use an
>> "ascii"(*) encoding when parse():ing R package source code
files?
>
>
> I think that would be a step backwards.  It would be better to accept other
> encodings.  As an English speaker this isn't a big deal to me, but
users of
> other languages may want to have messages and variable names in their
native
> language, and ASCII might not be enough for that.
Thanks for the feedback.  While I'll probably agree with you that R
packages should support other source code encodings than ASCII, that
would require a change in the specifications and design.  What I'm
proposing is (just) an adjustment to the implementation to meet the
current specs and design.
>
> On the other hand, I think it's quite reasonable to require a declared
> encoding if anything other than ASCII is used, and possibly to fail for
some
> encodings.  It is probably also reasonable to at least warn when non-ASCII
> characters are used in strings in packages on CRAN, as many users can't
> display all characters.
That would be a reasonable extension of the design, which would be
backward compatible with the current design, i.e. if encoding for the
source code is not declared, then it is assumed to be ASCII.

Source code comments are special, because by the current design
('Writing R Extensions'), it somehow leaves it open to use any type of
encoding.  If I read it freely, it could even be that you can use
different encoding for different comments in the same file (which is
not unlikely to occur considered cut'n'paste and open-source
licenses).  If other encodings are to be supported, then I see two
ways forward:

1. Have R completely ignore what's in the comments (what follows #
until the newline) such that encoding does not matter, or
2. require the same encoding for the source code comments as the rest
of the code.

As I see it, today's design falls (could fall?) under 1, but the
implementation does not go all the way to support it.

/Henrik

PS. It should be emphasized that this is about R packages. AFAIK, you
can already now source() code written in any encoding,
e.g.> raw <- as.raw(c(+  0xcf, 0x80, 0x20, 0x3c, 0x2d, 0x20, 0x70, 0x69, 0x0a,
+  0x70, 0x72, 0x69, 0x6e, 0x74, 0x28, 0xcf, 0x80, 0x29, 0x0a
+ ))> writeBin(raw, con="pi.R")
> source("pi.R", encoding="UTF-8")[1] 3.141593
>
> Duncan Murdoch
>>
>>
>> I believe this should be safe, because R code files should be in ASCII
>> [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments
>> you may use other characters.  This is from Section 'Package
>> subdirectories' in 'Writing R Extensions':
>>
>> "Only ASCII characters (and the control characters tab, formfeed,
LF
>> and CR) should be used in code files. Other characters are accepted in
>> comments, but then the comments may not be readable in e.g. a UTF-8
>> locale. Non-ASCII characters in object names will normally fail when
>> the package is installed. Any byte will be allowed in a quoted
>> character string but \uxxxx escapes should be used for non-ASCII
>> characters. However, non-ASCII character strings may not be usable in
>> some locales and may display incorrectly in others."
>>
>> Since comments are dropped by parse(), their actual content does not
>> matter, and the rest of the code should be in ASCII.
>>
>> (*) It could be that the specific encoding "ascii" is not
cross
>> platforms. If so, is there another way to specify a pure ASCII
>> encoding?
>>
>>
>>
>> BACKGROUND:
>> If a user/system sets the 'encoding' option at startup, it may
break
>> package installations from source if the package has source code
>> comments with non-ASCII characters.  For example,
>>
>> $ mkdir foo; cd foo
>> $ echo "options(encoding='UTF-8')" > .Rprofile
>> $ R --vanilla
>> > install.packages("R.oo", type="source")
>>
>> > install.packages("R.oo", type="source")
>> Installing package into 'C:/Users/hb/R/win-library/3.2'
>> (as 'lib' is unspecified)
>> --- Please select a CRAN mirror for use in this session ---
>> trying URL
'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz'
>> Content type 'application/x-gzip' length 394545 bytes (385 KB)
>> opened URL
>> downloaded 385 KB
>>
>> * installing *source* package 'R.oo' ...
>> ** package 'R.oo' successfully unpacked and MD5 sums checked
>> ** R
>> Warning in parse(outFile) :
>>    invalid input found on input connection
>> 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>> R.oo'
>> ** inst
>> ** preparing package for lazy loading
>> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source =
FALSE)
>> :
>>    invalid input found on input connection
>> 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>> R.oo'
>> ** help
>> [...]
>>
>> (This can be an extremely time consuming task to troubleshoot,
>> particularly if reported to a package maintainer not having access to
>> the original system).
>>
>> FYI, setting it only in the session is alright:
>>
>> > options(encoding="UTF-8")
>> > install.packages("R.oo", type="source")
>>
>> because install.packages() launches a separated R process for the
>> installation and it's only then the startup code becomes an issue.
>>
>>
>> TROUBLESHOOTING:
>> My understanding for the
>>
>> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source =
FALSE)
>> :
>>    invalid input found on input connection
>> 'C:/Users/hb/R/win-library/3.2/R.oo/R/
>>
>> is that this happens when there is a non-ASCII character in one of the
>> source-code comments (*) with a bit pattern matching a multi-byte
>> UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description].  For
>> instance, consider a source code comment with an acute accent:
>>
>> > raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61,
0x6e,
>> > 0x74, 0x0a))
>> > writeBin(raw, con="foo.R")
>> > code <- readLines("foo.R")
>> > code
>> [1] "# ?tudiant"
>>
>> > options(encoding="UTF-8")
>> > parse("foo.R")
>> Warning message:
>> In readLines(file, warn = FALSE) :
>>    invalid input found on input connection 'foo.R'
>>
>> > options(encoding="ascii")
>> > parse("foo.R")
>> expression()
>>
>> Reason for the "invalid input": The bit pattern for raw[3:5],
is:
>>
>> > R.utils::intToBin(raw[3:5])
>> [1] "11101001" "01110100" "01110101"
>>
>> The first byte (raw[3]) matched special UTF-8 byte pattern
"1110xxxx",
>> which according to UTF-8 should be followed by two more bytes with bit
>> patterns "10xxxxxx" and "10xxxxx"
>> [http://en.wikipedia.org/wiki/UTF-8#Description].  Since raw[4:5] does
>> not match those, it's an invalid UTF-8 byte sequence.  So,
technically
>> this does not happen for all comments using acute accents, but it's
>> very likely.  More generally, a multi-byte UTF-8 sequence is expected
>> when byte pattern "11xxxxx" (>= 192 in decimal values) is
encountered.
>> Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several
>> characters with this bit pattern for many "Latin-N"
encodings, which
>> I'd assume is still in dominant use by many developers.
>>
>> So, since options(encoding="UTF-8") was set at startup, that
is also
>> the encoding that R tries to follow.  My suggestion is that it seems
>> that R should be able to always use a pure-ASCII encoding when parsing
>> R code in packages, because that is what 'Writing R Extensions'
says
>> we should use in the first place.
>>
>> /Henrik
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

Bjørn-Helge Mevik

2014-Dec-12 09:12 UTC

head link

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

Duncan Murdoch <murdoch.duncan at gmail.com> writes:
> users of other languages may want to have messages and variable names
> in their native language, and ASCII might not be enough for that.
Allowing for messages in non-ASCII encodings would probably be a good
idea, but I think allowing non-ASCII variable names is dangerous.

-- 
Regards,
Bj?rn-Helge Mevik

Reasonably Related Threads

Search for more reasonably related threads

R devel - Dec 2014 - SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?

Reasonably Related Threads