thr3ads.net - R devel - [Rd] \U with more than 4 digits returns the wrong character [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Richard Cotton

2014-Dec-04 19:00 UTC

[Rd] \U with more than 4 digits returns the wrong character

If I type a character using \U syntax that has more than 4 digits, I
get the wrong character.  For example,

"\U1d4d0"

should print a mathematical bold script capital A.  See
http://www.fileformat.info/info/unicode/char/1d4d0/index.htm

On my machine, it prints the Hangul character corresponding to

"\Ud4d0"
http://www.fileformat.info/info/unicode/char/d4d0/index.htm

It seems that the hex-digit part is overflowing at 16^4.

I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under
Windows.  I played around with Sys.setlocale and options("encoding"),
but couldn't get the expected value.

Can others reproduce this?  It feels like a bug, but experience tells
me I probably have something silly going on with my setup.

-- 
Regards,
Richie

Mark van der Loo

2014-Dec-04 19:24 UTC

head link

[Rd] \U with more than 4 digits returns the wrong character

Richie,

The R language definition [1] says (10.3.1):

\Unnnnnnnn \U{nnnnnnnn}
(where multibyte locales are supported and not on Windows, otherwise
an error). Unicode character with given hex code ? sequences of up to
eight hex digits.


Best,
Mark

[1] http://cran.r-project.org/doc/manuals/r-release/R-lang.html
http://www.markvanderloo.eu
-------------------------------------------------------------------
If you cannot quantify it,
you don't know what you're talking about


On Thu, Dec 4, 2014 at 8:00 PM, Richard Cotton <richierocks at gmail.com>
wrote:> If I type a character using \U syntax that has more than 4 digits, I
> get the wrong character.  For example,
>
> "\U1d4d0"
>
> should print a mathematical bold script capital A.  See
> http://www.fileformat.info/info/unicode/char/1d4d0/index.htm
>
> On my machine, it prints the Hangul character corresponding to
>
> "\Ud4d0"
> http://www.fileformat.info/info/unicode/char/d4d0/index.htm
>
> It seems that the hex-digit part is overflowing at 16^4.
>
> I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under
> Windows.  I played around with Sys.setlocale and
options("encoding"),
> but couldn't get the expected value.
>
> Can others reproduce this?  It feels like a bug, but experience tells
> me I probably have something silly going on with my setup.
>
> --
> Regards,
> Richie
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Richard Cotton

2014-Dec-04 19:37 UTC

head link

[Rd] \U with more than 4 digits returns the wrong character

Great spot, thanks Mark.

This really ought to appear somewhere in the ?Quotes help page.

Having a warning under Windows might be nicer behaviour than silently
returning the wrong value too.

On 4 December 2014 at 22:24, Mark van der Loo <mark.vanderloo at
gmail.com> wrote:> Richie,
>
> The R language definition [1] says (10.3.1):
>
> \Unnnnnnnn \U{nnnnnnnn}
> (where multibyte locales are supported and not on Windows, otherwise
> an error). Unicode character with given hex code ? sequences of up to
> eight hex digits.
>
>
> Best,
> Mark
>
> [1] http://cran.r-project.org/doc/manuals/r-release/R-lang.html
> http://www.markvanderloo.eu
> -------------------------------------------------------------------
> If you cannot quantify it,
> you don't know what you're talking about
>
>
> On Thu, Dec 4, 2014 at 8:00 PM, Richard Cotton <richierocks at
gmail.com> wrote:
>> If I type a character using \U syntax that has more than 4 digits, I
>> get the wrong character.  For example,
>>
>> "\U1d4d0"
>>
>> should print a mathematical bold script capital A.  See
>> http://www.fileformat.info/info/unicode/char/1d4d0/index.htm
>>
>> On my machine, it prints the Hangul character corresponding to
>>
>> "\Ud4d0"
>> http://www.fileformat.info/info/unicode/char/d4d0/index.htm
>>
>> It seems that the hex-digit part is overflowing at 16^4.
>>
>> I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under
>> Windows.  I played around with Sys.setlocale and
options("encoding"),
>> but couldn't get the expected value.
>>
>> Can others reproduce this?  It feels like a bug, but experience tells
>> me I probably have something silly going on with my setup.
>>
>> --
>> Regards,
>> Richie
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Regards,
Richie

Learning R
4dpiecharts.com

Duncan Murdoch

2014-Dec-04 20:34 UTC

head link

[Rd] \U with more than 4 digits returns the wrong character

On 04/12/2014, 2:00 PM, Richard Cotton wrote:> If I type a character using \U syntax that has more than 4 digits, I
> get the wrong character.  For example,
> 
> "\U1d4d0"
> 
> should print a mathematical bold script capital A.  See
> http://www.fileformat.info/info/unicode/char/1d4d0/index.htm
> 
> On my machine, it prints the Hangul character corresponding to
> 
> "\Ud4d0"
> http://www.fileformat.info/info/unicode/char/d4d0/index.htm
> 
> It seems that the hex-digit part is overflowing at 16^4.
> 
> I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under
> Windows.  I played around with Sys.setlocale and
options("encoding"),
> but couldn't get the expected value.
> 
> Can others reproduce this?  It feels like a bug, but experience tells
> me I probably have something silly going on with my setup.
> 
I see this on Windows, but not on OSX.  On Windows:
> as.hexmode(utf8ToInt("\U1d4d0"))[1] "d4d0"

On OSX:
> as.hexmode(utf8ToInt("\U1d4d0"))[1] "1d4d0"

I'll see if I can find where the truncation is happening on Windows.

Duncan Murdoch

Duncan Murdoch

2014-Dec-04 21:21 UTC

head link

[Rd] \U with more than 4 digits returns the wrong character

On 04/12/2014, 2:00 PM, Richard Cotton wrote:> If I type a character using \U syntax that has more than 4 digits, I
> get the wrong character.  For example,
> 
> "\U1d4d0"
> 
> should print a mathematical bold script capital A.  See
> http://www.fileformat.info/info/unicode/char/1d4d0/index.htm
> 
> On my machine, it prints the Hangul character corresponding to
> 
> "\Ud4d0"
> http://www.fileformat.info/info/unicode/char/d4d0/index.htm
> 
> It seems that the hex-digit part is overflowing at 16^4.
> 
> I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under
> Windows.  I played around with Sys.setlocale and
options("encoding"),
> but couldn't get the expected value.
> 
> Can others reproduce this?  It feels like a bug, but experience tells
> me I probably have something silly going on with my setup.
> 
The issue is that on Windows, the wchar_t in our C code is a 16 bit
value.  In the old days, Windows only supported 16 bit characters.
Since Windows 2000 they've supported the full Unicode range (which I
think is currently 20 or 21 bits) using the UTF-16 encoding, but our
internal code is still assuming a 16 bit limit.

I'll submit the bug report on this and hopefully will get to it before
the next release, but it's tricky to catch all the possible places where
the upper bits get lost, given the dumb Windows convention that wchar_t
is 16 bits.

Duncan Murdoch

Reasonably Related Threads

Search for more apparently analagous threads

R devel - Dec 2014 - \U with more than 4 digits returns the wrong character

[Rd] \U with more than 4 digits returns the wrong character

[Rd] \U with more than 4 digits returns the wrong character

[Rd] \U with more than 4 digits returns the wrong character

[Rd] \U with more than 4 digits returns the wrong character

[Rd] \U with more than 4 digits returns the wrong character

Reasonably Related Threads