thr3ads.net - R devel - [Rd] Parsing and deparsing of escaped unicode characters [Jul 2014]

If this information is useful, please help other people find it:
Share via:

Jeroen Ooms

2014-Jul-28 08:47 UTC

[Rd] Parsing and deparsing of escaped unicode characters

In both R and JSON (and many other languages), unicode characters can
be escaped using a backslash followed by a lowercase "u" and a 4 digit
hex code. However when deparsing a character vector in R on Windows,
the non-latin characters get escaped as "<U+" followed by their 4
digit hex code and ">":
> x <- "I like \u5BFF\u53F8"
> cat(x)
I like ??> src <- deparse(x)
> cat(src)"I like <U+5BFF><U+53F8>"

Same thing happens on linux when we disable UTF8:

Sys.setlocale("LC_ALL", "C")
x <- "I like \u5BFF\u53F8"
nchar(x) #9, seems OK
cat(deparse(x))
"I like <U+5BFF><U+53F8>"

As a result, the code does not parse() back into the proper unicode
characters. I am currently using a regular expression to convert the
output of deparse into something that parse() (and json) supports:

utf8conv <- function(x) {
  gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",x)
}
> src <- utf8conv(src)
> y <- parse(text=src)[[1]]
> identical(x, y)[1] TRUE

However this is suboptimal because it introduces a big performance
overhead for large text. Several things are unclear to me:

 - Why does deparse() use a different escape notation than parse? Is
there a way to make deparse output \uXXXX for unicode instead?
 - Why does deparse on windows escape this in the first place, and not
keep the actual character when the locale supports it?

 > sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Yihui Xie

2014-Aug-03 19:30 UTC

head link

[Rd] Parsing and deparsing of escaped unicode characters

The behavior depends on the specific locale. When these characters are
deparsed in a Chinese locale, they work fine, but in an English
locale, they will get escaped:
> x <- "I like \u5BFF\u53F8"
> x
[1] "I like ??"> deparse(x)
[1] "\"I like ??\""> sessionInfo()R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
> Sys.setlocale(,'English')[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United
States.1252"> x
[1] "I like ??"> deparse(x)[1] "\"I like <U+5BFF><U+53F8>\""

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Web: http://yihui.name


On Mon, Jul 28, 2014 at 4:47 AM, Jeroen Ooms <jeroenooms at gmail.com>
wrote:> In both R and JSON (and many other languages), unicode characters can
> be escaped using a backslash followed by a lowercase "u" and a 4
digit
> hex code. However when deparsing a character vector in R on Windows,
> the non-latin characters get escaped as "<U+" followed by
their 4
> digit hex code and ">":
>
>> x <- "I like \u5BFF\u53F8"
>> cat(x)
> I like ??
>> src <- deparse(x)
>> cat(src)
> "I like <U+5BFF><U+53F8>"
>
> Same thing happens on linux when we disable UTF8:
>
> Sys.setlocale("LC_ALL", "C")
> x <- "I like \u5BFF\u53F8"
> nchar(x) #9, seems OK
> cat(deparse(x))
> "I like <U+5BFF><U+53F8>"
>
> As a result, the code does not parse() back into the proper unicode
> characters. I am currently using a regular expression to convert the
> output of deparse into something that parse() (and json) supports:
>
> utf8conv <- function(x) {
>   gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",x)
> }
>
>> src <- utf8conv(src)
>> y <- parse(text=src)[[1]]
>> identical(x, y)
> [1] TRUE
>
> However this is suboptimal because it introduces a big performance
> overhead for large text. Several things are unclear to me:
>
>  - Why does deparse() use a different escape notation than parse? Is
> there a way to make deparse output \uXXXX for unicode instead?
>  - Why does deparse on windows escape this in the first place, and not
> keep the actual character when the locale supports it?
>
>  > sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Jul 2014 - Parsing and deparsing of escaped unicode characters

[Rd] Parsing and deparsing of escaped unicode characters

[Rd] Parsing and deparsing of escaped unicode characters

Reasonably Related Threads