Jeroen Ooms
2014-Jul-28 08:47 UTC
[Rd] Parsing and deparsing of escaped unicode characters
In both R and JSON (and many other languages), unicode characters can be escaped using a backslash followed by a lowercase "u" and a 4 digit hex code. However when deparsing a character vector in R on Windows, the non-latin characters get escaped as "<U+" followed by their 4 digit hex code and ">":> x <- "I like \u5BFF\u53F8" > cat(x)I like ??> src <- deparse(x) > cat(src)"I like <U+5BFF><U+53F8>" Same thing happens on linux when we disable UTF8: Sys.setlocale("LC_ALL", "C") x <- "I like \u5BFF\u53F8" nchar(x) #9, seems OK cat(deparse(x)) "I like <U+5BFF><U+53F8>" As a result, the code does not parse() back into the proper unicode characters. I am currently using a regular expression to convert the output of deparse into something that parse() (and json) supports: utf8conv <- function(x) { gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",x) }> src <- utf8conv(src) > y <- parse(text=src)[[1]] > identical(x, y)[1] TRUE However this is suboptimal because it introduces a big performance overhead for large text. Several things are unclear to me: - Why does deparse() use a different escape notation than parse? Is there a way to make deparse output \uXXXX for unicode instead? - Why does deparse on windows escape this in the first place, and not keep the actual character when the locale supports it? > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base
The behavior depends on the specific locale. When these characters are deparsed in a Chinese locale, they work fine, but in an English locale, they will get escaped:> x <- "I like \u5BFF\u53F8" > x[1] "I like ??"> deparse(x)[1] "\"I like ??\""> sessionInfo()R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] stats graphics grDevices utils datasets methods base> Sys.setlocale(,'English')[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"> x[1] "I like ??"> deparse(x)[1] "\"I like <U+5BFF><U+53F8>\"" Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Web: http://yihui.name On Mon, Jul 28, 2014 at 4:47 AM, Jeroen Ooms <jeroenooms at gmail.com> wrote:> In both R and JSON (and many other languages), unicode characters can > be escaped using a backslash followed by a lowercase "u" and a 4 digit > hex code. However when deparsing a character vector in R on Windows, > the non-latin characters get escaped as "<U+" followed by their 4 > digit hex code and ">": > >> x <- "I like \u5BFF\u53F8" >> cat(x) > I like ?? >> src <- deparse(x) >> cat(src) > "I like <U+5BFF><U+53F8>" > > Same thing happens on linux when we disable UTF8: > > Sys.setlocale("LC_ALL", "C") > x <- "I like \u5BFF\u53F8" > nchar(x) #9, seems OK > cat(deparse(x)) > "I like <U+5BFF><U+53F8>" > > As a result, the code does not parse() back into the proper unicode > characters. I am currently using a regular expression to convert the > output of deparse into something that parse() (and json) supports: > > utf8conv <- function(x) { > gsub("<U\\+([0-9A-F]{4})>","\\\\u\\1",x) > } > >> src <- utf8conv(src) >> y <- parse(text=src)[[1]] >> identical(x, y) > [1] TRUE > > However this is suboptimal because it introduces a big performance > overhead for large text. Several things are unclear to me: > > - Why does deparse() use a different escape notation than parse? Is > there a way to make deparse output \uXXXX for unicode instead? > - Why does deparse on windows escape this in the first place, and not > keep the actual character when the locale supports it? > > > sessionInfo() > R version 3.1.1 (2014-07-10) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel