An earlier thread (in 10/2006) discussed encoding issues in the context of R data and the desire to represent accented characters. It matters in another setting: the output generated by R and the seemingly order character "'" (single quote). In particular, R CMD check runs test code and compares the generated output to a saved file of expected output. This does not work reliably across encoding schemes. This is unfortunate, since it seems the "expected output" files will necessarily be wrong for someone. The problem for me was triggered by the single-quote character "'". On my older systems, this is encoded by 0x27, a perfectly fine ASCII character. That is on a Debian GNU/Linux system with LANG=en_US. On a newer system I have LANG=en_US.UTF-8. I don't recall whether this was a deliberate choice on my part, or simply reflects changing defaults for the installer. (Note the earlier thread referred to the Debian-derived Ubuntu systems as having switched to UTF-8). Under UTF-8 the same character is encoded in the 3-byte sequence 0xE28098 (which seems odd; I thought the point of UTF-8 was that ASCII was a legitimate subset). The coefficient printing methods in the stats package use the single-quote in the key explaining significance levels: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 I suppose one possible work-around for R CMD check would be to set the encoding to some standard value before it runs tests, but that has some drawbacks. It doesn't work for packages needing a different encoding (but perhaps the package could specify an encoding to use by default?)(*), It will leave the output files looking weird on systems with a different encoding. It will get messed up if one generates the files under the wrong encoding. And none of this addresses stuff beyond the context of output file comparison in R CMD check. Any thoughts? Ross Boylan * From the R Extensions document, discussing the DESCRIPTION file: If the `DESCRIPTION' file is not entirely in ASCII it should contain an `Encoding' field specifying an encoding. This is currently used as the encoding of the `DESCRIPTION' file itself, and may in the future be taken as the encoding for other documentation in the package. Only encoding names `latin1', `latin2' and `UTF-8' are known to be portable. I would not expect that the test output files be considered "documentation," but I suppose that's subject to interpretation.
Ross Boylan
2007-Jan-19 19:39 UTC
[Rd] encoding issues even w/o accents (background on single quotes)
On Wed, Jan 17, 2007 at 11:56:15PM -0800, Ross Boylan wrote:> An earlier thread (in 10/2006) discussed encoding issues in the > context of R data and the desire to represent accented characters. > > It matters in another setting: the output generated by R and the > seemingly order character "'" (single quote). In particular, R CMD^^^ should be "ordinary"> check runs test code and compares the generated output to a saved file > of expected output. This does not work reliably across encoding > schemes. This is unfortunate, since it seems the "expected output" > files will necessarily be wrong for someone. > > The problem for me was triggered by the single-quote character "'". > On my older systems, this is encoded by 0x27, a perfectly fine ASCII > character. That is on a Debian GNU/Linux system with LANG=en_US. On > a newer system I have LANG=en_US.UTF-8. I don't recall whether > this was a deliberate choice on my part, or simply reflects changing > defaults for the installer. (Note the earlier thread referred to the > Debian-derived Ubuntu systems as having switched to UTF-8). Under > UTF-8 the same character is encoded in the 3-byte sequence 0xE28098 > (which seems odd; I thought the point of UTF-8 was that ASCII was a > legitimate subset).Apparently quoting, particularly single quotes, is a can of worms: http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html When Unicode is available (which would be the case with UTF-8), particular non-ASCII characters are recommended for single quoting. The 3 byte sequence is the UTF-8 encoding of x2018, the recommended left single quote mark. See http://en.wikipedia.org/wiki/UTF-8 on UTF-8 encoding. This is more than I or, probably, you ever wanted to know about this issue! Ross> > The coefficient printing methods in the stats package use the > single-quote in the key explaining significance levels: > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > I suppose one possible work-around for R CMD check would be to set the > encoding to some standard value before it runs tests, but that has > some drawbacks. It doesn't work for packages needing a different > encoding (but perhaps the package could specify an encoding to use by > default?)(*), It will leave the output files looking weird on systems > with a different encoding. It will get messed up if one generates the > files under the wrong encoding. > > And none of this addresses stuff beyond the context of output file > comparison in R CMD check. > > Any thoughts? > > Ross Boylan > > > * From the R Extensions document, discussing the DESCRIPTION file: > If the `DESCRIPTION' file is not entirely in ASCII it should contain > an `Encoding' field specifying an encoding. This is currently used as > the encoding of the `DESCRIPTION' file itself, and may in the future be > taken as the encoding for other documentation in the package. Only > encoding names `latin1', `latin2' and `UTF-8' are known to be portable. > > I would not expect that the test output files be considered > "documentation," but I suppose that's subject to interpretation.