Daniel Bastos
2016-Jan-29 15:35 UTC
[R] on specifying an encoding for plot's main-argument
Here's how I plot a graph. plot(c(1,2,3), main = "graph ?") The main-string has a UTF-8 character "?". I believe I'm using the windows device. It opens up on my screen. (The window says ``R Graphics: Device 2 (ACTIVE)''.) How can I tell it to use my encoding of choice? I looked around the web for explanations on how to properly tell the relevant mecanisms that I'm using strings with a partcular encoding when plotting. I saw many with my difficulty, but no one seemed to explain the whole issue. At first I thought I should tell the device. So I looked at the documentation for various devices. I realized only devices such as postscript, pdf had an encoding parameter. ``My assumptions must be wrong'', I thought. ``Perhaps it's not the device I must tell my encoding.'' Then I come to you. Can you point me towards understanding the issue? You can tell me to read an entire book on encoding, charset and fonts. I'd like to free myself from such difficulties. I use R and ESS (GNU EMACS). (My ESS console says 'U' in the EMACS modeline. It means I'm encoding in UTF-8. I tried '1', ISO-8859-1, also called Latin-1.) Thank you. (*) The softwares R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: i386-w64-mingw32/i386 (32-bit) ess-version: 15.09-2 [Released git: 01328e83039f] GNU Emacs 24.3.1 (i386-mingw-nt6.1.7601) of 2013-03-17 on MARVIN
Duncan Murdoch
2016-Jan-30 01:57 UTC
[R] on specifying an encoding for plot's main-argument
On 29/01/2016 10:35 AM, Daniel Bastos wrote:> Here's how I plot a graph. > > plot(c(1,2,3), main = "graph ?") > > The main-string has a UTF-8 character "?". I believe I'm using the > windows device. It opens up on my screen. (The window says ``R > Graphics: Device 2 (ACTIVE)''.) How can I tell it to use my encoding of > choice?As far as I know that's impossible. R uses the system encoding, and I don't think any Windows versions use UTF-8 code pages. They use UTF-16 for wide characters, and some 8 bit encoding for byte-sized characters. R will use whatever 8 bit code page Windows chooses.> > I looked around the web for explanations on how to properly tell the > relevant mecanisms that I'm using strings with a partcular encoding when > plotting. I saw many with my difficulty, but no one seemed to explain > the whole issue.If you enter the string as a literal, it is not using UTF-8 encoding, it's using the system's 8 bit encoding.> > At first I thought I should tell the device. So I looked at the > documentation for various devices. I realized only devices such as > postscript, pdf had an encoding parameter. ``My assumptions must be > wrong'', I thought. ``Perhaps it's not the device I must tell my > encoding.'' > > Then I come to you. Can you point me towards understanding the issue? > You can tell me to read an entire book on encoding, charset and fonts. > I'd like to free myself from such difficulties. > > I use R and ESS (GNU EMACS). (My ESS console says 'U' in the EMACS > modeline. It means I'm encoding in UTF-8. I tried '1', ISO-8859-1, > also called Latin-1.)Duncan Murdoch
Daniel Bastos
2016-Feb-01 19:56 UTC
[R] on specifying an encoding for plot's main-argument
Duncan Murdoch <murdoch.duncan at gmail.com> writes:> On 29/01/2016 10:35 AM, Daniel Bastos wrote: >> Here's how I plot a graph. >> >> plot(c(1,2,3), main = "graph ?") >> >> The main-string has a UTF-8 character "?". I believe I'm using the >> windows device. It opens up on my screen. (The window says ``R >> Graphics: Device 2 (ACTIVE)''.) How can I tell it to use my encoding of >> choice? > > As far as I know that's impossible. R uses the system encoding, and I > don't think any Windows versions use UTF-8 code pages. They use > UTF-16 for wide characters, and some 8 bit encoding for byte-sized > characters. R will use whatever 8 bit code page Windows chooses.You seem to be correct. Here's what Microsoft has to say. ``[...] UTF-16 [...] is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems.''[1] They also claim that ``[w]hile Unicode-enabled functions in Windows use UTF-16, it is also possible to work with data encoded in UTF-8 or UTF-7, which are supported in Windows as multibyte character set code pages.''[1] But I couldn't verify the claim. The documentation of setlocale[2] says the ``set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.''[2] That seems to be correct as per the following C code. printf("locale: %s\n", setlocale(LC_ALL, "UTF-8")); And [3] makes me think that _wsetlocale behaves the same way: ``_wsetlocale [...] is a wide-character version of setlocale; the arguments and return values of _wsetlocale are wide-character strings.'' The following program seems to confirm it. int main(int argc, char *argv[]) { printf("locale: %s\n", _wsetlocale(LC_ALL, (const wchar_t *) "UTF-8")); return 0; } [...] (*) A workaround Since R comes with iconv(), the following might be a safe way to translate UTF-8 into the current system locale, displaying correctly plot's titles on Windows systems. iconv("utf8-string", from="UTF-8", to=localeToCharset(Sys.getlocale("LC_CTYPE"))) (*) References [1] MSDN Unicode https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx [2] MSDN setlocale https://msdn.microsoft.com/en-us/library/x99tb11d.aspx [3] MSDN Locales and Code Pages https://msdn.microsoft.com/en-us/library/8w60z792.aspx