I'm not R-core, but happen to have run into this issue.
I think this makes sense conceptually, and have had the same thought
myself. One implementation challenge is that the parser has a special
branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such
input to 10K wide characters, so the parser would need to be modified in
order to make this a general solution:
> parse(text=sprintf('"%s"',
strrep("G\\u00e1bor", 2000)))
Error in parse(text = sprintf("\"%s\"",
strrep("G\\u00e1bor", 2000))) :
string at line 1 containing Unicode escapes not in this locale
is too long (max 10000 chars)
Such strings are rare so maybe an interim solution is just to allow it
for deparsing of shorter strings. The parser modification itself would
also have the benefit of speeding up parsing of strings without Unicode
escapes.
Best,
B.
On 2/21/22 5:33 AM, G?bor Cs?rdi wrote:> I am wondering if it would make sense to produce \u escaped strings in
> deparse() for UTF-8 input. Currently we have (in R-devel):
>
> x <- "G\u00e1bor"
> Sys.setlocale("LC_ALL", "C")
> #> [1] "C/C/C/C/C/en_US.UTF-8"
>
> deparse(x)
> #> [1] "\"G<U+00E1>bor\""
>
> charToRaw(deparse(x))
> #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
>
> Is there a reason why this is preferable instead of returning
>
> "\"G\\u00e1bor\""
>
> i.e.
>
> charToRaw("\"G\\u00e1bor\"")
> #> [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
>
> Returning the \u escaped form would make deparse() the inverse of
> parse(), at least in this respect.
>
> Thank you,
> Gabor
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel