thr3ads.net - R devel - [Rd] Non-ASCII chars in R code [May 2006]

If this information is useful, please help other people find it:
Share via:

Prof Brian Ripley

2006-May-17 18:40 UTC

[Rd] Non-ASCII chars in R code

The report on R_help about problems loading package irr (in a 
UTF-8 locale, it seemed) prompted me to look a little deeper.  There are 
quite a few packages with Latin-1 chars in their .R files, and a couple in 
UTF-8.

Apart from non-ASCII chars in comments, this is a problem as the code 
concerned cannot be represented in some locales R runs in (for example 
Japanese on Windows).  It happens that irr is so small that lazy-loading 
is not used, but when lazy-loading or a saved image is used, the locale in 
use when the package is installed determines how the code is parsed (and 
may not be the same as when the package is used, and indeed it is not 
uncommon on Linux/Unix systems for different users to use different 
locales).

This means that using non-ASCII chars is not portable, and I've added code 
to R CMD check in R-devel to warn about such usage.  In the examples I 
have investigated the usages have been

- messages in a non-English language, typically French.
- startup messages with people's names.
- use of characters that I can only guess are intended to be in the
   WinAnsi encoding, e.g. a copyright symbol.

The only reason I have not made this an error is that people might want to 
produce packages for a known locale, e.g. a student class, but perhaps it 
should be an error for packages submitted to CRAN.

I do not believe there is much we can do about this: messages which are 
not entirely in ASCII cannot be displayed on many R platforms and it seems 
incorrect to allow French messages and not Japanese ones.

The packages currently throwing warnings are

FactoMineR FunCluster JointGLM LoopAnalyst Sciviews ade4 adehabitat ape 
climatol crossdes deal grasper irr lsa mvrpart pastecs sn surveillance 
truncgof


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Prof Brian Ripley

2006-May-19 10:04 UTC

head link

[Rd] Non-ASCII chars in R code

A little more digging revealed a Unix/Windows discrepancy here.

On Unix, saving images and preparing for lazyloading/lazydata is done with 
LC_ALL=C: on Windows with LC_COLLATE=C.  I will change Windows to match.

Unfortunately how the C locale is implemented is OS-dependent.  Strictly 
it should not allow bytes 0x80 to 0xff but it does on some OSes (including 
Windows).  So the strict consequences of this should be that when using
lazy-loading or a saved image

- all names have to be ASCII alphanumeric
- \uxxxx sequences are not allowed except \u007f and lower (they are not
   valid at all in a C locale prior to 2.3.1 so I would not expect to see
   them in a package).
- bytes in character strings are copied byte for byte.

This leaves an inconsistency between packages which use lazy-loading / 
save image and those which do not.  We could resolve that by switching to 
the C locale when loading R code in packages (or, better, R code that was 
not a loader stub): I didn't think that would be worthwhile but in fact 5 
of the packages listed are small enough not to be lazy-loaded.

The other consequence is that the only way we allow packages to have 
object names which are not ASCII alphanumeric is to disable lazy loading.
One possibility is to allow a package to specify its required locale for 
loading in the DESCRIPTION file, and make use of that.

I am inclined to do nothing about these issues unless people have an 
actual need to have packages tailored on a non-English locale.

On Wed, 17 May 2006, Prof Brian Ripley wrote:
> The report on R_help about problems loading package irr (in a UTF-8 locale,
> it seemed) prompted me to look a little deeper.  There are quite a few 
> packages with Latin-1 chars in their .R files, and a couple in UTF-8.
>
> Apart from non-ASCII chars in comments, this is a problem as the code 
> concerned cannot be represented in some locales R runs in (for example 
> Japanese on Windows).  It happens that irr is so small that lazy-loading is
> not used, but when lazy-loading or a saved image is used, the locale in use
> when the package is installed determines how the code is parsed (and may
not
> be the same as when the package is used, and indeed it is not uncommon on 
> Linux/Unix systems for different users to use different locales).
>
> This means that using non-ASCII chars is not portable, and I've added
code to
> R CMD check in R-devel to warn about such usage.  In the examples I have 
> investigated the usages have been
>
> - messages in a non-English language, typically French.
> - startup messages with people's names.
> - use of characters that I can only guess are intended to be in the
>  WinAnsi encoding, e.g. a copyright symbol.
>
> The only reason I have not made this an error is that people might want to 
> produce packages for a known locale, e.g. a student class, but perhaps it 
> should be an error for packages submitted to CRAN.
>
> I do not believe there is much we can do about this: messages which are not
> entirely in ASCII cannot be displayed on many R platforms and it seems 
> incorrect to allow French messages and not Japanese ones.
>
> The packages currently throwing warnings are
>
> FactoMineR FunCluster JointGLM LoopAnalyst Sciviews ade4 adehabitat ape 
> climatol crossdes deal grasper irr lsa mvrpart pastecs sn surveillance 
> truncgof
>
>
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Maybe Matching Threads

Search for more maybe matching threads

R devel - May 2006 - Non-ASCII chars in R code

[Rd] Non-ASCII chars in R code

[Rd] Non-ASCII chars in R code

Maybe Matching Threads