Since R 2.5.0 it has been possible to declare the encodings of character strings (at the level of individual elements of a character vector). As a reminder, here is the announcement in NEWS o R now attempts to keep track of character strings which are known to be in Latin-1 or UTF-8 and print or plot them appropriately in other locales. This is primarily intended to make it possible to use data in Western European languages in both Latin-1 and UTF-8 locales. Currently scan(), read.table(), readLines(), parse() and source() allow encodings to be declared, and console input in suitable locales is also recognized. New function Encoding() can read or set the declared encodings for a character vector. Whereas R itself is careful to make use of this, I see very little recognition of it in packages -- which need to be making use of translateChar() rather than CHAR(): see the 'Writing R Extensions' manual. (I see it used in only one package, and that mainly in a copy of base R code.) This will become more important as time goes by and more ways are introduced to generate marked data. In particular, in R 2.7.0 under Windows 'Unicode' data (as used by NT-based versions of Windows, usually UCS-2 but possibly UTF-16) is translated to UTF-8 and marked as such. In essence, every time you use CHAR() in .Call/.External call in a package you should consider if the data can be non-ASCII and if so how you want to handle it. The choices are - to replace CHAR() by translateChar() and handle the string in the native encoding of the current locale. This needs the package to depend on 'R (>= 2.5.0)'. - to note the declared encoding and handle the string in that encoding. - to translate the string to UTF-8 and handle it in UTF-8. This will be easiest to do in R >= 2.7.0 using the function translateCharUTF8(). For writers of graphics devices where is a further twist in R >= 2.7.0: currently text is passed to the graphics device in the native encoding, but by setting the DevDesc variable hasTextUTF8 to TRUE you can indicate to the graphics engine the ability to accept text in UTF-8. This is done in several of the standard devices: for example windows() was already re-encoding to UCS-2 for plotting, and postscript()/pdf() also re-encode to the selected single-byte encoding. Character data passed to .C or .Fortran is automatically re-encoded to the current locale (for .C, from the encoding specified by ENCODING=, otherwise from the declared encoding if any). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595