Richard Cotton
2014-Dec-04 19:00 UTC
[Rd] \U with more than 4 digits returns the wrong character
If I type a character using \U syntax that has more than 4 digits, I get the wrong character. For example, "\U1d4d0" should print a mathematical bold script capital A. See http://www.fileformat.info/info/unicode/char/1d4d0/index.htm On my machine, it prints the Hangul character corresponding to "\Ud4d0" http://www.fileformat.info/info/unicode/char/d4d0/index.htm It seems that the hex-digit part is overflowing at 16^4. I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under Windows. I played around with Sys.setlocale and options("encoding"), but couldn't get the expected value. Can others reproduce this? It feels like a bug, but experience tells me I probably have something silly going on with my setup. -- Regards, Richie
Mark van der Loo
2014-Dec-04 19:24 UTC
[Rd] \U with more than 4 digits returns the wrong character
Richie, The R language definition [1] says (10.3.1): \Unnnnnnnn \U{nnnnnnnn} (where multibyte locales are supported and not on Windows, otherwise an error). Unicode character with given hex code ? sequences of up to eight hex digits. Best, Mark [1] http://cran.r-project.org/doc/manuals/r-release/R-lang.html http://www.markvanderloo.eu ------------------------------------------------------------------- If you cannot quantify it, you don't know what you're talking about On Thu, Dec 4, 2014 at 8:00 PM, Richard Cotton <richierocks at gmail.com> wrote:> If I type a character using \U syntax that has more than 4 digits, I > get the wrong character. For example, > > "\U1d4d0" > > should print a mathematical bold script capital A. See > http://www.fileformat.info/info/unicode/char/1d4d0/index.htm > > On my machine, it prints the Hangul character corresponding to > > "\Ud4d0" > http://www.fileformat.info/info/unicode/char/d4d0/index.htm > > It seems that the hex-digit part is overflowing at 16^4. > > I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under > Windows. I played around with Sys.setlocale and options("encoding"), > but couldn't get the expected value. > > Can others reproduce this? It feels like a bug, but experience tells > me I probably have something silly going on with my setup. > > -- > Regards, > Richie > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Richard Cotton
2014-Dec-04 19:37 UTC
[Rd] \U with more than 4 digits returns the wrong character
Great spot, thanks Mark. This really ought to appear somewhere in the ?Quotes help page. Having a warning under Windows might be nicer behaviour than silently returning the wrong value too. On 4 December 2014 at 22:24, Mark van der Loo <mark.vanderloo at gmail.com> wrote:> Richie, > > The R language definition [1] says (10.3.1): > > \Unnnnnnnn \U{nnnnnnnn} > (where multibyte locales are supported and not on Windows, otherwise > an error). Unicode character with given hex code ? sequences of up to > eight hex digits. > > > Best, > Mark > > [1] http://cran.r-project.org/doc/manuals/r-release/R-lang.html > http://www.markvanderloo.eu > ------------------------------------------------------------------- > If you cannot quantify it, > you don't know what you're talking about > > > On Thu, Dec 4, 2014 at 8:00 PM, Richard Cotton <richierocks at gmail.com> wrote: >> If I type a character using \U syntax that has more than 4 digits, I >> get the wrong character. For example, >> >> "\U1d4d0" >> >> should print a mathematical bold script capital A. See >> http://www.fileformat.info/info/unicode/char/1d4d0/index.htm >> >> On my machine, it prints the Hangul character corresponding to >> >> "\Ud4d0" >> http://www.fileformat.info/info/unicode/char/d4d0/index.htm >> >> It seems that the hex-digit part is overflowing at 16^4. >> >> I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under >> Windows. I played around with Sys.setlocale and options("encoding"), >> but couldn't get the expected value. >> >> Can others reproduce this? It feels like a bug, but experience tells >> me I probably have something silly going on with my setup. >> >> -- >> Regards, >> Richie >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel-- Regards, Richie Learning R 4dpiecharts.com
Duncan Murdoch
2014-Dec-04 20:34 UTC
[Rd] \U with more than 4 digits returns the wrong character
On 04/12/2014, 2:00 PM, Richard Cotton wrote:> If I type a character using \U syntax that has more than 4 digits, I > get the wrong character. For example, > > "\U1d4d0" > > should print a mathematical bold script capital A. See > http://www.fileformat.info/info/unicode/char/1d4d0/index.htm > > On my machine, it prints the Hangul character corresponding to > > "\Ud4d0" > http://www.fileformat.info/info/unicode/char/d4d0/index.htm > > It seems that the hex-digit part is overflowing at 16^4. > > I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under > Windows. I played around with Sys.setlocale and options("encoding"), > but couldn't get the expected value. > > Can others reproduce this? It feels like a bug, but experience tells > me I probably have something silly going on with my setup. >I see this on Windows, but not on OSX. On Windows:> as.hexmode(utf8ToInt("\U1d4d0"))[1] "d4d0" On OSX:> as.hexmode(utf8ToInt("\U1d4d0"))[1] "1d4d0" I'll see if I can find where the truncation is happening on Windows. Duncan Murdoch
Duncan Murdoch
2014-Dec-04 21:21 UTC
[Rd] \U with more than 4 digits returns the wrong character
On 04/12/2014, 2:00 PM, Richard Cotton wrote:> If I type a character using \U syntax that has more than 4 digits, I > get the wrong character. For example, > > "\U1d4d0" > > should print a mathematical bold script capital A. See > http://www.fileformat.info/info/unicode/char/1d4d0/index.htm > > On my machine, it prints the Hangul character corresponding to > > "\Ud4d0" > http://www.fileformat.info/info/unicode/char/d4d0/index.htm > > It seems that the hex-digit part is overflowing at 16^4. > > I tested this on R3.1.2 and devel (2014-12-03 r67101) x64 under > Windows. I played around with Sys.setlocale and options("encoding"), > but couldn't get the expected value. > > Can others reproduce this? It feels like a bug, but experience tells > me I probably have something silly going on with my setup. >The issue is that on Windows, the wchar_t in our C code is a 16 bit value. In the old days, Windows only supported 16 bit characters. Since Windows 2000 they've supported the full Unicode range (which I think is currently 20 or 21 bits) using the UTF-16 encoding, but our internal code is still assuming a 16 bit limit. I'll submit the bug report on this and hopefully will get to it before the next release, but it's tricky to catch all the possible places where the upper bits get lost, given the dumb Windows convention that wchar_t is 16 bits. Duncan Murdoch