thr3ads.net - R devel - [Rd] Native characterset is wrong for unicode builds for Windows [Feb 2015]

If this information is useful, please help other people find it:
Share via:

maillist at tlink.de

2015-Feb-27 07:31 UTC

[Rd] Native characterset is wrong for unicode builds for Windows

Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:> On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>>> When I send some outlandish characters through enc2native (or
format) in
>>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>>
>>>>    > "?????"
>>>> [1] "?????"
>>>>    > enc2native("?????")
>>>> [1] "?????"
>>>>    > Encoding(enc2native("?????"))
>>>> [1] "UTF-8"
>>>>
>>>> In Windows the result is different:
>>>>
>>>>    > "?????"
>>>> [1] "?????"
>>>>    > enc2native("?????")
>>>> [1] "??<U+0394><U+040A><U+05EA>"
>>>>    > Encoding(enc2native("?????"))
>>>> [1] "latin1"
>>>>
>>>> And this is wrong. The native character set of a unicode
application
>>>> under Windows is *Unicode*. enc2native should do the same under
Windows
>>>> as it does on Ubuntu. Also the "unknown" encoding
should be changed to
>>>> mean the same as "UTF-8" exactly as it is on Linux.
>>> What is a "unicode application", and what makes you think
R is one?  R
>>> is being told by Windows that your native encoding is latin1. 
Perhaps
>>> Windows 8 supports UTF-8 as a native encoding (I've never used
it), but
>>> previous versions of Windows didn't.
>>>
>>> Duncan Murdoch
>>>
>> A unicode application is a program that uses the unicode API of Windows
> R uses those functions, so I guess it is a "unicode application".
But
> internally it uses an 8 bit encoding (normally the native one for the
> platform it is running on, which in your case is apparently latin1).
>
>> - the functions with the ending W. For such a application the system
>> code page (native encoding) is completely irrelevant. The system code
>> page is just a compatibility feature that enables Windows NT/Vista/7/8
>> to run applications that were developed for Windows 95 which didn't
have
>> unicode support.
> Windows 95 had UCS-2 support, which was pretty close to UTF-16.
>
> But this line of operating systems is dead for 10 years
>> now. R obviously is a unicode application because it can print - or
read
>> from the clipboard - characters like "???" that are not in my
system
>> code page which is not possible over the legacy API.
> So "unicode application" is something you just made up.
>
> If you use Windows development tools, they have macros to convert
> generic functions to either A or W versions.  R doesn't use those.  It
> calls the W functions when it has UTF-16 characters, and A functions
> when it has native characters.  I would love it if R was a UTF-8
> application, because it would make life so much simpler, but Windows
> doesn't support that.  So R needs to do tons of conversions.  If you
> don't like that, you probably need to stick with Ubuntu.
>
> Duncan Murdoch
>
I am not complaining about those conversions. They work just fine 
already. I am complaining about
enc2native breaking things in the windows builds. An assignment like

s <- format("?????")

has no interaction with windows at all yet "s" contains garbage like  
"??<U+0394><U+040A><U+05EA>"
after that. And if a native encoding of UTF-8 - as defined by enc2native 
- works in Ubuntu why shouldn't it work
in Windows?

Duncan Murdoch

2015-Feb-27 10:49 UTC

head link

[Rd] Native characterset is wrong for unicode builds for Windows

On 27/02/2015 2:31 AM, maillist at tlink.de wrote:> Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
>> On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>>>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>>>> When I send some outlandish characters through enc2native
(or format) in
>>>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>>>
>>>>>    > "?????"
>>>>> [1] "?????"
>>>>>    > enc2native("?????")
>>>>> [1] "?????"
>>>>>    > Encoding(enc2native("?????"))
>>>>> [1] "UTF-8"
>>>>>
>>>>> In Windows the result is different:
>>>>>
>>>>>    > "?????"
>>>>> [1] "?????"
>>>>>    > enc2native("?????")
>>>>> [1]
"??<U+0394><U+040A><U+05EA>"
>>>>>    > Encoding(enc2native("?????"))
>>>>> [1] "latin1"
>>>>>
>>>>> And this is wrong. The native character set of a unicode
application
>>>>> under Windows is *Unicode*. enc2native should do the same
under Windows
>>>>> as it does on Ubuntu. Also the "unknown" encoding
should be changed to
>>>>> mean the same as "UTF-8" exactly as it is on
Linux.
>>>> What is a "unicode application", and what makes you
think R is one?  R
>>>> is being told by Windows that your native encoding is latin1. 
Perhaps
>>>> Windows 8 supports UTF-8 as a native encoding (I've never
used it), but
>>>> previous versions of Windows didn't.
>>>>
>>>> Duncan Murdoch
>>>>
>>> A unicode application is a program that uses the unicode API of
Windows
>> R uses those functions, so I guess it is a "unicode
application".  But
>> internally it uses an 8 bit encoding (normally the native one for the
>> platform it is running on, which in your case is apparently latin1).
>>
>>> - the functions with the ending W. For such a application the
system
>>> code page (native encoding) is completely irrelevant. The system
code
>>> page is just a compatibility feature that enables Windows
NT/Vista/7/8
>>> to run applications that were developed for Windows 95 which
didn't have
>>> unicode support.
>> Windows 95 had UCS-2 support, which was pretty close to UTF-16.
>>
>> But this line of operating systems is dead for 10 years
>>> now. R obviously is a unicode application because it can print - or
read
>>> from the clipboard - characters like "???" that are not
in my system
>>> code page which is not possible over the legacy API.
>> So "unicode application" is something you just made up.
>>
>> If you use Windows development tools, they have macros to convert
>> generic functions to either A or W versions.  R doesn't use those. 
It
>> calls the W functions when it has UTF-16 characters, and A functions
>> when it has native characters.  I would love it if R was a UTF-8
>> application, because it would make life so much simpler, but Windows
>> doesn't support that.  So R needs to do tons of conversions.  If
you
>> don't like that, you probably need to stick with Ubuntu.
>>
>> Duncan Murdoch
>>
> 
> I am not complaining about those conversions. They work just fine 
> already. I am complaining about
> enc2native breaking things in the windows builds. An assignment like
> 
> s <- format("?????")
> 
> has no interaction with windows at all yet "s" contains garbage
like
> "??<U+0394><U+040A><U+05EA>"
> after that. And if a native encoding of UTF-8 - as defined by enc2native 
> - works in Ubuntu why shouldn't it work
> in Windows?
Because in Ubuntu, UTF-8 is the native encoding, and in your Windows
system, latin1 is the native encoding.

But I do agree that the format() issue is a problem.  I haven't traced
through the code, but I think the string "?????" is read using Windows
API functions that return a UTF-16 result, then converted by R to UTF-8.
 So format() should see that it is a UTF-8 string and not convert it to
the native encoding.  There is nothing wrong with enc2native(), it's
doing what you asked for.  The problem is that format() is using it.

Duncan Murdoch

maillist at tlink.de

2015-Feb-27 20:01 UTC

head link

[Rd] Native characterset is wrong for unicode builds for Windows

Am 27.02.2015 um 11:49 schrieb Duncan Murdoch:> On 27/02/2015 2:31 AM, maillist at tlink.de wrote:
>> Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
>>> On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>>>>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>>>>> When I send some outlandish characters through
enc2native (or format) in
>>>>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>>>>
>>>>>>     > "?????"
>>>>>> [1] "?????"
>>>>>>     > enc2native("?????")
>>>>>> [1] "?????"
>>>>>>     > Encoding(enc2native("?????"))
>>>>>> [1] "UTF-8"
>>>>>>
>>>>>> In Windows the result is different:
>>>>>>
>>>>>>     > "?????"
>>>>>> [1] "?????"
>>>>>>     > enc2native("?????")
>>>>>> [1]
"??<U+0394><U+040A><U+05EA>"
>>>>>>     > Encoding(enc2native("?????"))
>>>>>> [1] "latin1"
>>>>>>
>>>>>> And this is wrong. The native character set of a
unicode application
>>>>>> under Windows is *Unicode*. enc2native should do the
same under Windows
>>>>>> as it does on Ubuntu. Also the "unknown"
encoding should be changed to
>>>>>> mean the same as "UTF-8" exactly as it is on
Linux.
>>>>> What is a "unicode application", and what makes
you think R is one?  R
>>>>> is being told by Windows that your native encoding is
latin1.  Perhaps
>>>>> Windows 8 supports UTF-8 as a native encoding (I've
never used it), but
>>>>> previous versions of Windows didn't.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>> A unicode application is a program that uses the unicode API of
Windows
>>> R uses those functions, so I guess it is a "unicode
application".  But
>>> internally it uses an 8 bit encoding (normally the native one for
the
>>> platform it is running on, which in your case is apparently
latin1).
>>>
>>>> - the functions with the ending W. For such a application the
system
>>>> code page (native encoding) is completely irrelevant. The
system code
>>>> page is just a compatibility feature that enables Windows
NT/Vista/7/8
>>>> to run applications that were developed for Windows 95 which
didn't have
>>>> unicode support.
>>> Windows 95 had UCS-2 support, which was pretty close to UTF-16.
>>>
>>> But this line of operating systems is dead for 10 years
>>>> now. R obviously is a unicode application because it can print
- or read
>>>> from the clipboard - characters like "???" that are
not in my system
>>>> code page which is not possible over the legacy API.
>>> So "unicode application" is something you just made up.
>>>
>>> If you use Windows development tools, they have macros to convert
>>> generic functions to either A or W versions.  R doesn't use
those.  It
>>> calls the W functions when it has UTF-16 characters, and A
functions
>>> when it has native characters.  I would love it if R was a UTF-8
>>> application, because it would make life so much simpler, but
Windows
>>> doesn't support that.  So R needs to do tons of conversions. 
If you
>>> don't like that, you probably need to stick with Ubuntu.
>>>
>>> Duncan Murdoch
>>>
>> I am not complaining about those conversions. They work just fine
>> already. I am complaining about
>> enc2native breaking things in the windows builds. An assignment like
>>
>> s <- format("?????")
>>
>> has no interaction with windows at all yet "s" contains
garbage like
>> "??<U+0394><U+040A><U+05EA>"
>> after that. And if a native encoding of UTF-8 - as defined by
enc2native
>> - works in Ubuntu why shouldn't it work
>> in Windows?
> Because in Ubuntu, UTF-8 is the native encoding, and in your Windows
> system, latin1 is the native encoding.
>
> But I do agree that the format() issue is a problem.  I haven't traced
> through the code, but I think the string "?????" is read using
Windows
> API functions that return a UTF-16 result, then converted by R to UTF-8.
>   So format() should see that it is a UTF-8 string and not convert it to
> the native encoding.  There is nothing wrong with enc2native(), it's
> doing what you asked for.  The problem is that format() is using it.
>
> Duncan Murdoch
I would expect that every function that is using enc2native is broken in 
this respect because it invariably will scramble most unicode characters 
in the process and I can't think of a case where this could be wanted 
actually.
Functions that really need something other than UTF-8 are probably using 
iconv and getOption("encoding") anyway as this allows to specify the 
desired encoding much more flexible.

Reasonably Related Threads

Search for more reasonably related threads

R devel - Feb 2015 - Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

[Rd] Native characterset is wrong for unicode builds for Windows

Reasonably Related Threads