Jason Wood
2013-Mar-20 19:32 UTC
[Rd] Character Encoding: Why are valid Windows-1252 characters encoded as invalid ISO-8859-1 characters?
Something that looks like a bug to me - but as there may be a documented reason I have missed, I wanted to ask about it here first. Please let me know if this looks like something I should submit as a bug, if not, why this behavior is intended. Using RGui v2.15.3, 64bit, on a Windows 7 machine with US English locale You can see the behavior I describe in the following --------------------> Sys.getlocale("LC_CTYPE") # my default encoding is windows code page 1252[1] "English_United States.1252"> localeToCharset() # R thinks the best character set to use is ISO8859-1,a subset of windows-1252 [1] "ISO8859-1"> x<-"\x92" # I create a 'right quote' character, using a value valid inwindows-1252 but NOT VALID in ISO8859-1> Encoding(x) # R has chosen to encode it as 'latin1' which seems to be asynonym for ISO8859-1 [1] "latin1"> x # Even tho character is invalid in latin1, it renders as if it were thevalid windows-1252 character [1] "’"> enc2utf8(x) # Encoding as UTF-8 gives us, not a valid UTF-8 'right quote'(/u2019), but the undefined unicode character 'PRIVATE USE TWO' [1] "\u0092"> enc2native(enc2utf8(x)) # Moving the UTF-8 to back to the native encodingcorrectly shows that it can't render the 'PRIVATE USE TWO' character in windows-1252 [1] "<U+0092>" --------------------- I think the problem occurs when R decides that the valid 1252 character should be represented by default in a 'Latin1' (ISO8859-1) encoded string rather than the native 'windows-1252' Note that if we force the encoding to stay native, everything works fine: ----------------------> Encoding(x)<-"unknown" # Force the encoding to the native 1252 > enc2utf8(x) # Encoding as UTF-8 now gives us the valid UTF-8 'rightquote' character [1] "’"> enc2native(enc2utf8(x)) # and going back to the native encoding worksexactly as it should [1] "’" [[alternative HTML version deleted]]
Maybe Matching Threads
- special latin1 do not print as glyphs in current devel on windows
- special latin1 do not print as glyphs in current devel on windows
- special latin1 do not print as glyphs in current devel on windows
- special latin1 do not print as glyphs in current devel on windows
- Native characterset is wrong for unicode builds for Windows