Henrik Bengtsson
2014-Dec-11 17:59 UTC
[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?
SUGGESTION: Would it make sense if install.packages() and friends always use an "ascii"(*) encoding when parse():ing R package source code files? I believe this should be safe, because R code files should be in ASCII [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments you may use other characters. This is from Section 'Package subdirectories' in 'Writing R Extensions': "Only ASCII characters (and the control characters tab, formfeed, LF and CR) should be used in code files. Other characters are accepted in comments, but then the comments may not be readable in e.g. a UTF-8 locale. Non-ASCII characters in object names will normally fail when the package is installed. Any byte will be allowed in a quoted character string but \uxxxx escapes should be used for non-ASCII characters. However, non-ASCII character strings may not be usable in some locales and may display incorrectly in others." Since comments are dropped by parse(), their actual content does not matter, and the rest of the code should be in ASCII. (*) It could be that the specific encoding "ascii" is not cross platforms. If so, is there another way to specify a pure ASCII encoding? BACKGROUND: If a user/system sets the 'encoding' option at startup, it may break package installations from source if the package has source code comments with non-ASCII characters. For example, $ mkdir foo; cd foo $ echo "options(encoding='UTF-8')" > .Rprofile $ R --vanilla> install.packages("R.oo", type="source")> install.packages("R.oo", type="source")Installing package into 'C:/Users/hb/R/win-library/3.2' (as 'lib' is unspecified) --- Please select a CRAN mirror for use in this session --- trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz' Content type 'application/x-gzip' length 394545 bytes (385 KB) opened URL downloaded 385 KB * installing *source* package 'R.oo' ... ** package 'R.oo' successfully unpacked and MD5 sums checked ** R Warning in parse(outFile) : invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/ R.oo' ** inst ** preparing package for lazy loading Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) : invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/ R.oo' ** help [...] (This can be an extremely time consuming task to troubleshoot, particularly if reported to a package maintainer not having access to the original system). FYI, setting it only in the session is alright:> options(encoding="UTF-8") > install.packages("R.oo", type="source")because install.packages() launches a separated R process for the installation and it's only then the startup code becomes an issue. TROUBLESHOOTING: My understanding for the Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) : invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/ is that this happens when there is a non-ASCII character in one of the source-code comments (*) with a bit pattern matching a multi-byte UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description]. For instance, consider a source code comment with an acute accent:> raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e, 0x74, 0x0a)) > writeBin(raw, con="foo.R") > code <- readLines("foo.R") > code[1] "# ?tudiant"> options(encoding="UTF-8") > parse("foo.R")Warning message: In readLines(file, warn = FALSE) : invalid input found on input connection 'foo.R'> options(encoding="ascii") > parse("foo.R")expression() Reason for the "invalid input": The bit pattern for raw[3:5], is:> R.utils::intToBin(raw[3:5])[1] "11101001" "01110100" "01110101" The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx", which according to UTF-8 should be followed by two more bytes with bit patterns "10xxxxxx" and "10xxxxx" [http://en.wikipedia.org/wiki/UTF-8#Description]. Since raw[4:5] does not match those, it's an invalid UTF-8 byte sequence. So, technically this does not happen for all comments using acute accents, but it's very likely. More generally, a multi-byte UTF-8 sequence is expected when byte pattern "11xxxxx" (>= 192 in decimal values) is encountered. Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several characters with this bit pattern for many "Latin-N" encodings, which I'd assume is still in dominant use by many developers. So, since options(encoding="UTF-8") was set at startup, that is also the encoding that R tries to follow. My suggestion is that it seems that R should be able to always use a pure-ASCII encoding when parsing R code in packages, because that is what 'Writing R Extensions' says we should use in the first place. /Henrik
Duncan Murdoch
2014-Dec-11 18:47 UTC
[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?
On 11/12/2014 12:59 PM, Henrik Bengtsson wrote:> SUGGESTION: > Would it make sense if install.packages() and friends always use an > "ascii"(*) encoding when parse():ing R package source code files?I think that would be a step backwards. It would be better to accept other encodings. As an English speaker this isn't a big deal to me, but users of other languages may want to have messages and variable names in their native language, and ASCII might not be enough for that. On the other hand, I think it's quite reasonable to require a declared encoding if anything other than ASCII is used, and possibly to fail for some encodings. It is probably also reasonable to at least warn when non-ASCII characters are used in strings in packages on CRAN, as many users can't display all characters. Duncan Murdoch> > I believe this should be safe, because R code files should be in ASCII > [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments > you may use other characters. This is from Section 'Package > subdirectories' in 'Writing R Extensions': > > "Only ASCII characters (and the control characters tab, formfeed, LF > and CR) should be used in code files. Other characters are accepted in > comments, but then the comments may not be readable in e.g. a UTF-8 > locale. Non-ASCII characters in object names will normally fail when > the package is installed. Any byte will be allowed in a quoted > character string but \uxxxx escapes should be used for non-ASCII > characters. However, non-ASCII character strings may not be usable in > some locales and may display incorrectly in others." > > Since comments are dropped by parse(), their actual content does not > matter, and the rest of the code should be in ASCII. > > (*) It could be that the specific encoding "ascii" is not cross > platforms. If so, is there another way to specify a pure ASCII > encoding? > > > > BACKGROUND: > If a user/system sets the 'encoding' option at startup, it may break > package installations from source if the package has source code > comments with non-ASCII characters. For example, > > $ mkdir foo; cd foo > $ echo "options(encoding='UTF-8')" > .Rprofile > $ R --vanilla > > install.packages("R.oo", type="source") > > > install.packages("R.oo", type="source") > Installing package into 'C:/Users/hb/R/win-library/3.2' > (as 'lib' is unspecified) > --- Please select a CRAN mirror for use in this session --- > trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz' > Content type 'application/x-gzip' length 394545 bytes (385 KB) > opened URL > downloaded 385 KB > > * installing *source* package 'R.oo' ... > ** package 'R.oo' successfully unpacked and MD5 sums checked > ** R > Warning in parse(outFile) : > invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/ > R.oo' > ** inst > ** preparing package for lazy loading > Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) : > invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/ > R.oo' > ** help > [...] > > (This can be an extremely time consuming task to troubleshoot, > particularly if reported to a package maintainer not having access to > the original system). > > FYI, setting it only in the session is alright: > > > options(encoding="UTF-8") > > install.packages("R.oo", type="source") > > because install.packages() launches a separated R process for the > installation and it's only then the startup code becomes an issue. > > > TROUBLESHOOTING: > My understanding for the > > Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) : > invalid input found on input connection 'C:/Users/hb/R/win-library/3.2/R.oo/R/ > > is that this happens when there is a non-ASCII character in one of the > source-code comments (*) with a bit pattern matching a multi-byte > UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description]. For > instance, consider a source code comment with an acute accent: > > > raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e, 0x74, 0x0a)) > > writeBin(raw, con="foo.R") > > code <- readLines("foo.R") > > code > [1] "# ?tudiant" > > > options(encoding="UTF-8") > > parse("foo.R") > Warning message: > In readLines(file, warn = FALSE) : > invalid input found on input connection 'foo.R' > > > options(encoding="ascii") > > parse("foo.R") > expression() > > Reason for the "invalid input": The bit pattern for raw[3:5], is: > > > R.utils::intToBin(raw[3:5]) > [1] "11101001" "01110100" "01110101" > > The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx", > which according to UTF-8 should be followed by two more bytes with bit > patterns "10xxxxxx" and "10xxxxx" > [http://en.wikipedia.org/wiki/UTF-8#Description]. Since raw[4:5] does > not match those, it's an invalid UTF-8 byte sequence. So, technically > this does not happen for all comments using acute accents, but it's > very likely. More generally, a multi-byte UTF-8 sequence is expected > when byte pattern "11xxxxx" (>= 192 in decimal values) is encountered. > Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several > characters with this bit pattern for many "Latin-N" encodings, which > I'd assume is still in dominant use by many developers. > > So, since options(encoding="UTF-8") was set at startup, that is also > the encoding that R tries to follow. My suggestion is that it seems > that R should be able to always use a pure-ASCII encoding when parsing > R code in packages, because that is what 'Writing R Extensions' says > we should use in the first place. > > /Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Henrik Bengtsson
2014-Dec-11 20:28 UTC
[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?
On Thu, Dec 11, 2014 at 10:47 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 11/12/2014 12:59 PM, Henrik Bengtsson wrote: >> >> SUGGESTION: >> Would it make sense if install.packages() and friends always use an >> "ascii"(*) encoding when parse():ing R package source code files? > > > I think that would be a step backwards. It would be better to accept other > encodings. As an English speaker this isn't a big deal to me, but users of > other languages may want to have messages and variable names in their native > language, and ASCII might not be enough for that.Thanks for the feedback. While I'll probably agree with you that R packages should support other source code encodings than ASCII, that would require a change in the specifications and design. What I'm proposing is (just) an adjustment to the implementation to meet the current specs and design.> > On the other hand, I think it's quite reasonable to require a declared > encoding if anything other than ASCII is used, and possibly to fail for some > encodings. It is probably also reasonable to at least warn when non-ASCII > characters are used in strings in packages on CRAN, as many users can't > display all characters.That would be a reasonable extension of the design, which would be backward compatible with the current design, i.e. if encoding for the source code is not declared, then it is assumed to be ASCII. Source code comments are special, because by the current design ('Writing R Extensions'), it somehow leaves it open to use any type of encoding. If I read it freely, it could even be that you can use different encoding for different comments in the same file (which is not unlikely to occur considered cut'n'paste and open-source licenses). If other encodings are to be supported, then I see two ways forward: 1. Have R completely ignore what's in the comments (what follows # until the newline) such that encoding does not matter, or 2. require the same encoding for the source code comments as the rest of the code. As I see it, today's design falls (could fall?) under 1, but the implementation does not go all the way to support it. /Henrik PS. It should be emphasized that this is about R packages. AFAIK, you can already now source() code written in any encoding, e.g.> raw <- as.raw(c(+ 0xcf, 0x80, 0x20, 0x3c, 0x2d, 0x20, 0x70, 0x69, 0x0a, + 0x70, 0x72, 0x69, 0x6e, 0x74, 0x28, 0xcf, 0x80, 0x29, 0x0a + ))> writeBin(raw, con="pi.R") > source("pi.R", encoding="UTF-8")[1] 3.141593> > Duncan Murdoch >> >> >> I believe this should be safe, because R code files should be in ASCII >> [http://en.wikipedia.org/wiki/ASCII] and only in source-code comments >> you may use other characters. This is from Section 'Package >> subdirectories' in 'Writing R Extensions': >> >> "Only ASCII characters (and the control characters tab, formfeed, LF >> and CR) should be used in code files. Other characters are accepted in >> comments, but then the comments may not be readable in e.g. a UTF-8 >> locale. Non-ASCII characters in object names will normally fail when >> the package is installed. Any byte will be allowed in a quoted >> character string but \uxxxx escapes should be used for non-ASCII >> characters. However, non-ASCII character strings may not be usable in >> some locales and may display incorrectly in others." >> >> Since comments are dropped by parse(), their actual content does not >> matter, and the rest of the code should be in ASCII. >> >> (*) It could be that the specific encoding "ascii" is not cross >> platforms. If so, is there another way to specify a pure ASCII >> encoding? >> >> >> >> BACKGROUND: >> If a user/system sets the 'encoding' option at startup, it may break >> package installations from source if the package has source code >> comments with non-ASCII characters. For example, >> >> $ mkdir foo; cd foo >> $ echo "options(encoding='UTF-8')" > .Rprofile >> $ R --vanilla >> > install.packages("R.oo", type="source") >> >> > install.packages("R.oo", type="source") >> Installing package into 'C:/Users/hb/R/win-library/3.2' >> (as 'lib' is unspecified) >> --- Please select a CRAN mirror for use in this session --- >> trying URL 'http://cran.at.r-project.org/src/contrib/R.oo_1.18.0.tar.gz' >> Content type 'application/x-gzip' length 394545 bytes (385 KB) >> opened URL >> downloaded 385 KB >> >> * installing *source* package 'R.oo' ... >> ** package 'R.oo' successfully unpacked and MD5 sums checked >> ** R >> Warning in parse(outFile) : >> invalid input found on input connection >> 'C:/Users/hb/R/win-library/3.2/R.oo/R/ >> R.oo' >> ** inst >> ** preparing package for lazy loading >> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) >> : >> invalid input found on input connection >> 'C:/Users/hb/R/win-library/3.2/R.oo/R/ >> R.oo' >> ** help >> [...] >> >> (This can be an extremely time consuming task to troubleshoot, >> particularly if reported to a package maintainer not having access to >> the original system). >> >> FYI, setting it only in the session is alright: >> >> > options(encoding="UTF-8") >> > install.packages("R.oo", type="source") >> >> because install.packages() launches a separated R process for the >> installation and it's only then the startup code becomes an issue. >> >> >> TROUBLESHOOTING: >> My understanding for the >> >> Warning in parse(n = -1, file = file, srcfile = NULL, keep.source = FALSE) >> : >> invalid input found on input connection >> 'C:/Users/hb/R/win-library/3.2/R.oo/R/ >> >> is that this happens when there is a non-ASCII character in one of the >> source-code comments (*) with a bit pattern matching a multi-byte >> UTF-8 sequence [http://en.wikipedia.org/wiki/UTF-8#Description]. For >> instance, consider a source code comment with an acute accent: >> >> > raw <- as.raw(c(0x23, 0x20, 0xe9, 0x74, 0x75, 0x64, 0x69, 0x61, 0x6e, >> > 0x74, 0x0a)) >> > writeBin(raw, con="foo.R") >> > code <- readLines("foo.R") >> > code >> [1] "# ?tudiant" >> >> > options(encoding="UTF-8") >> > parse("foo.R") >> Warning message: >> In readLines(file, warn = FALSE) : >> invalid input found on input connection 'foo.R' >> >> > options(encoding="ascii") >> > parse("foo.R") >> expression() >> >> Reason for the "invalid input": The bit pattern for raw[3:5], is: >> >> > R.utils::intToBin(raw[3:5]) >> [1] "11101001" "01110100" "01110101" >> >> The first byte (raw[3]) matched special UTF-8 byte pattern "1110xxxx", >> which according to UTF-8 should be followed by two more bytes with bit >> patterns "10xxxxxx" and "10xxxxx" >> [http://en.wikipedia.org/wiki/UTF-8#Description]. Since raw[4:5] does >> not match those, it's an invalid UTF-8 byte sequence. So, technically >> this does not happen for all comments using acute accents, but it's >> very likely. More generally, a multi-byte UTF-8 sequence is expected >> when byte pattern "11xxxxx" (>= 192 in decimal values) is encountered. >> Looking http://en.wikipedia.org/wiki/ISO/IEC_8859, there are several >> characters with this bit pattern for many "Latin-N" encodings, which >> I'd assume is still in dominant use by many developers. >> >> So, since options(encoding="UTF-8") was set at startup, that is also >> the encoding that R tries to follow. My suggestion is that it seems >> that R should be able to always use a pure-ASCII encoding when parsing >> R code in packages, because that is what 'Writing R Extensions' says >> we should use in the first place. >> >> /Henrik >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > >
Bjørn-Helge Mevik
2014-Dec-12 09:12 UTC
[Rd] SUGGESTION: Force install.packages() to use ASCII encoding when parse():ing code?
Duncan Murdoch <murdoch.duncan at gmail.com> writes:> users of other languages may want to have messages and variable names > in their native language, and ASCII might not be enough for that.Allowing for messages in non-ASCII encodings would probably be a good idea, but I think allowing non-ASCII variable names is dangerous. -- Regards, Bj?rn-Helge Mevik