thr3ads.net - R devel - [Rd] Native characterset is wrong for unicode builds for Windows [Feb 2015]

If this information is useful, please help other people find it:
Share via:

maillist at tlink.de

2015-Feb-26 20:09 UTC

[Rd] Native characterset is wrong for unicode builds for Windows

When I send some outlandish characters through enc2native (or format) in 
R 3.1.2 on Ubuntu trusty it works quite well:

 > "?????"
[1] "?????"
 > enc2native("?????")
[1] "?????"
 > Encoding(enc2native("?????"))
[1] "UTF-8"

In Windows the result is different:

 > "?????"
[1] "?????"
 > enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
 > Encoding(enc2native("?????"))
[1] "latin1"

And this is wrong. The native character set of a unicode application 
under Windows is *Unicode*. enc2native should do the same under Windows 
as it does on Ubuntu. Also the "unknown" encoding should be changed to
mean the same as "UTF-8" exactly as it is on Linux.

Duncan Murdoch

2015-Feb-26 22:22 UTC

head link

[Rd] Native characterset is wrong for unicode builds for Windows

On 26/02/2015 3:09 PM, maillist at tlink.de wrote:> 
> When I send some outlandish characters through enc2native (or format) in 
> R 3.1.2 on Ubuntu trusty it works quite well:
> 
>  > "?????"
> [1] "?????"
>  > enc2native("?????")
> [1] "?????"
>  > Encoding(enc2native("?????"))
> [1] "UTF-8"
> 
> In Windows the result is different:
> 
>  > "?????"
> [1] "?????"
>  > enc2native("?????")
> [1] "??<U+0394><U+040A><U+05EA>"
>  > Encoding(enc2native("?????"))
> [1] "latin1"
> 
> And this is wrong. The native character set of a unicode application 
> under Windows is *Unicode*. enc2native should do the same under Windows 
> as it does on Ubuntu. Also the "unknown" encoding should be
changed to
> mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one?  R
is being told by Windows that your native encoding is latin1.  Perhaps
Windows 8 supports UTF-8 as a native encoding (I've never used it), but
previous versions of Windows didn't.

Duncan Murdoch

Winston Chang

2015-Feb-26 22:44 UTC

head link

[Rd] Native characterset is wrong for unicode builds for Windows

On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de <maillist at
tlink.de>
wrote:
>
> When I send some outlandish characters through enc2native (or format) in R
> 3.1.2 on Ubuntu trusty it works quite well:
>
> > "?????"
> [1] "?????"
> > enc2native("?????")
> [1] "?????"
> > Encoding(enc2native("?????"))
> [1] "UTF-8"
>
> In Windows the result is different:
>
> > "?????"
> [1] "?????"
> > enc2native("?????")
> [1] "??<U+0394><U+040A><U+05EA>"
> > Encoding(enc2native("?????"))
> [1] "latin1"
>
> And this is wrong. The native character set of a unicode application under
> Windows is *Unicode*. enc2native should do the same under Windows as it
> does on Ubuntu. Also the "unknown" encoding should be changed to
mean the
> same as "UTF-8" exactly as it is on Linux.
>
I think you're mixing up the term "character set" with the
encoding for a
character set. Unicode is a character set. UTF-8 is one of many encodings
of Unicode.

UTF-8 may be the native character encoding in Ubuntu, but it's not the
native encoding in Windows. According to this, what counts as the native
encoding in Windows depends on the code page:
  http://stackoverflow.com/a/4649507

So you shouldn't expect enc2native to do the same thing on Linux and
Windows. If you really want UTF-8, you can use enc2utf8.

-Winston

	[[alternative HTML version deleted]]

maillist at tlink.de

2015-Feb-26 23:34 UTC

head link

[Rd] Native characterset is wrong for unicode builds for Windows

> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>> When I send some outlandish characters through enc2native (or format)
in
>> R 3.1.2 on Ubuntu trusty it works quite well:
>>
>>   > "?????"
>> [1] "?????"
>>   > enc2native("?????")
>> [1] "?????"
>>   > Encoding(enc2native("?????"))
>> [1] "UTF-8"
>>
>> In Windows the result is different:
>>
>>   > "?????"
>> [1] "?????"
>>   > enc2native("?????")
>> [1] "??<U+0394><U+040A><U+05EA>"
>>   > Encoding(enc2native("?????"))
>> [1] "latin1"
>>
>> And this is wrong. The native character set of a unicode application
>> under Windows is *Unicode*. enc2native should do the same under Windows
>> as it does on Ubuntu. Also the "unknown" encoding should be
changed to
>> mean the same as "UTF-8" exactly as it is on Linux.
> What is a "unicode application", and what makes you think R is
one?  R
> is being told by Windows that your native encoding is latin1.  Perhaps
> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
> previous versions of Windows didn't.
>
> Duncan Murdoch
>A unicode application is a program that uses the unicode API of Windows 
- the functions with the ending W. For such a application the system 
code page (native encoding) is completely irrelevant. The system code 
page is just a compatibility feature that enables Windows NT/Vista/7/8 
to run applications that were developed for Windows 95 which didn't have 
unicode support. But this line of operating systems is dead for 10 years 
now. R obviously is a unicode application because it can print - or read 
from the clipboard - characters like "???" that are not in my system 
code page which is not possible over the legacy API.

Neither the unicode API nor the legacy API accepts UTF-8. The legacy API 
needs strings encoded according to the active code page and the unicode 
API wants UTF-16. If you have UTF-8 you need to convert it in either to 
the active code page which will loose all characters that are not 
covered by it or convert to UTF-16 and use the unicode functions. But 
this is not the problem, the Windows interface functions of R are 
working quite nicely with unicode already.

maillist at tlink.de

2015-Feb-26 23:55 UTC

head link

[Rd] Native characterset is wrong for unicode builds for Windows

Am 26.02.2015 um 23:44 schrieb Winston Chang:> On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de 
> <mailto:maillist at tlink.de> <maillist at tlink.de 
> <mailto:maillist at tlink.de>> wrote:
>
>
>     When I send some outlandish characters through enc2native (or
>     format) in R 3.1.2 on Ubuntu trusty it works quite well:
>
>     > "?????"
>     [1] "?????"
>     > enc2native("?????")
>     [1] "?????"
>     > Encoding(enc2native("?????"))
>     [1] "UTF-8"
>
>     In Windows the result is different:
>
>     > "?????"
>     [1] "?????"
>     > enc2native("?????")
>     [1] "??<U+0394><U+040A><U+05EA>"
>     > Encoding(enc2native("?????"))
>     [1] "latin1"
>
>     And this is wrong. The native character set of a unicode
>     application under Windows is *Unicode*. enc2native should do the
>     same under Windows as it does on Ubuntu. Also the "unknown"
>     encoding should be changed to mean the same as "UTF-8"
exactly as
>     it is on Linux.
>
>
> I think you're mixing up the term "character set" with the
encoding
> for a character set. Unicode is a character set. UTF-8 is one of many 
> encodings of Unicode.
>
> UTF-8 may be the native character encoding in Ubuntu, but it's not the 
> native encoding in Windows. According to this, what counts as the 
> native encoding in Windows depends on the code page:
> http://stackoverflow.com/a/4649507
>
> So you shouldn't expect enc2native to do the same thing on Linux and 
> Windows. If you really want UTF-8, you can use enc2utf8.
>
> -Winston
Maybe I'm expecting too much but I rather have R not to produce garbage 
like "??<U+0394><U+040A><U+05EA>" and while I can
use enc2utf8 to
convert from UTF-8 to UTF-8 this does not fix the many places - like 
"format" - where enc2native is used and that are broken because of
this.



	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more seemingly similar threads

R devel - Feb 2015 - Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

Maybe Matching Threads