thr3ads.net - R devel - [Rd] encoding issues even w/o accents [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Ross Boylan

2007-Jan-18 07:56 UTC

[Rd] encoding issues even w/o accents

An earlier thread (in 10/2006) discussed encoding issues in the
context of R data and the desire to represent accented characters.

It matters in another setting: the output generated by R and the
seemingly order character "'" (single quote).  In particular, R
CMD
check runs test code and compares the generated output to a saved file
of expected output.  This does not work reliably across encoding
schemes.  This is unfortunate, since it seems the "expected output"
files will necessarily be wrong for someone.

The problem for me was triggered by the single-quote character
"'".
On my older systems, this is encoded by 0x27, a perfectly fine ASCII
character.  That is on a Debian GNU/Linux system with LANG=en_US.  On
a newer system I have LANG=en_US.UTF-8.  I don't recall whether
this was a deliberate choice on my part, or simply reflects changing
defaults for the installer.  (Note the earlier thread referred to the
Debian-derived Ubuntu systems as having switched to UTF-8).  Under
UTF-8 the same character is encoded in the 3-byte sequence 0xE28098
(which seems odd; I thought the point of UTF-8 was that ASCII was a
legitimate subset).

The coefficient  printing methods in the stats package use the
single-quote in the key explaining significance levels:
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

I suppose one possible work-around for R CMD check would be to set the
encoding to  some standard value before it runs tests, but that has
some drawbacks.  It doesn't work for packages needing a different
encoding (but perhaps the package could specify an encoding to use by
default?)(*),  It will leave the output files looking weird on systems
with a different encoding.  It will get messed up if one generates the
files under the wrong encoding.

And none of this addresses stuff beyond the context of output file
comparison in R CMD check.

Any thoughts?

Ross Boylan


* From the R Extensions document, discussing the DESCRIPTION file:
   If the `DESCRIPTION' file is not entirely in ASCII it should contain
an `Encoding' field specifying an encoding.  This is currently used as
the encoding of the `DESCRIPTION' file itself, and may in the future be
taken as the encoding for other documentation in the package.  Only
encoding names `latin1', `latin2' and `UTF-8' are known to be
portable.

I would not expect that the test output files be considered
"documentation," but I suppose that's subject to interpretation.

Ross Boylan

2007-Jan-19 19:39 UTC

head link

[Rd] encoding issues even w/o accents (background on single quotes)

On Wed, Jan 17, 2007 at 11:56:15PM -0800, Ross Boylan
wrote:> An earlier thread (in 10/2006) discussed encoding issues in the
> context of R data and the desire to represent accented characters.
> 
> It matters in another setting: the output generated by R and the
> seemingly order character "'" (single quote).  In particular,
R CMD
            ^^^ should be "ordinary"> check runs test code and compares the generated output to a saved file
> of expected output.  This does not work reliably across encoding
> schemes.  This is unfortunate, since it seems the "expected
output"
> files will necessarily be wrong for someone.
> 
> The problem for me was triggered by the single-quote character
"'".
> On my older systems, this is encoded by 0x27, a perfectly fine ASCII
> character.  That is on a Debian GNU/Linux system with LANG=en_US.  On
> a newer system I have LANG=en_US.UTF-8.  I don't recall whether
> this was a deliberate choice on my part, or simply reflects changing
> defaults for the installer.  (Note the earlier thread referred to the
> Debian-derived Ubuntu systems as having switched to UTF-8).  Under
> UTF-8 the same character is encoded in the 3-byte sequence 0xE28098
> (which seems odd; I thought the point of UTF-8 was that ASCII was a
> legitimate subset).
Apparently quoting, particularly single quotes, is a can of worms:
http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
When Unicode is available (which would be the case with UTF-8),
particular non-ASCII characters are recommended for single quoting.
The 3 byte sequence is the UTF-8 encoding of x2018, the recommended
left single quote mark.

See http://en.wikipedia.org/wiki/UTF-8 on UTF-8 encoding.

This is more than I or, probably, you ever wanted to know about this
issue!

Ross
> 
> The coefficient  printing methods in the stats package use the
> single-quote in the key explaining significance levels:
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
> 
> I suppose one possible work-around for R CMD check would be to set the
> encoding to  some standard value before it runs tests, but that has
> some drawbacks.  It doesn't work for packages needing a different
> encoding (but perhaps the package could specify an encoding to use by
> default?)(*),  It will leave the output files looking weird on systems
> with a different encoding.  It will get messed up if one generates the
> files under the wrong encoding.
> 
> And none of this addresses stuff beyond the context of output file
> comparison in R CMD check.
> 
> Any thoughts?
> 
> Ross Boylan
> 
> 
> * From the R Extensions document, discussing the DESCRIPTION file:
>    If the `DESCRIPTION' file is not entirely in ASCII it should contain
> an `Encoding' field specifying an encoding.  This is currently used as
> the encoding of the `DESCRIPTION' file itself, and may in the future be
> taken as the encoding for other documentation in the package.  Only
> encoding names `latin1', `latin2' and `UTF-8' are known to be
portable.
> 
> I would not expect that the test output files be considered
> "documentation," but I suppose that's subject to
interpretation.

Maybe Matching Threads

Search for more maybe matching threads

R devel - Jan 2007 - encoding issues even w/o accents

[Rd] encoding issues even w/o accents

[Rd] encoding issues even w/o accents (background on single quotes)

Maybe Matching Threads