thr3ads.net - R devel - [Rd] deparse() and UTF-8 strings [Feb 2022]

If this information is useful, please help other people find it:
Share via:

Brodie Gaslam

2022-Feb-21 13:17 UTC

[Rd] deparse() and UTF-8 strings

I'm not R-core, but happen to have run into this issue.

I think this makes sense conceptually, and have had the same thought 
myself.  One implementation challenge is that the parser has a special 
branch for Unicode escape strings (e.g. "G\u00e1bor") that limits such
input to 10K wide characters, so the parser would need to be modified in 
order to make this a general solution:

 > parse(text=sprintf('"%s"',
strrep("G\\u00e1bor", 2000)))
Error in parse(text = sprintf("\"%s\"",
strrep("G\\u00e1bor", 2000))) :
   string at line 1 containing Unicode escapes not in this locale
is too long (max 10000 chars)

Such strings are rare so maybe an interim solution is just to allow it 
for deparsing of shorter strings.  The parser modification itself would 
also have the benefit of speeding up parsing of strings without Unicode 
escapes.

Best,

B.

On 2/21/22 5:33 AM, G?bor Cs?rdi wrote:> I am wondering if it would make sense to produce \u escaped strings in
> deparse() for UTF-8 input. Currently we have (in R-devel):
> 
> x <- "G\u00e1bor"
> Sys.setlocale("LC_ALL", "C")
> #> [1] "C/C/C/C/C/en_US.UTF-8"
> 
> deparse(x)
> #> [1] "\"G<U+00E1>bor\""
> 
> charToRaw(deparse(x))
> #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
> 
> Is there a reason why this is preferable instead of returning
> 
> "\"G\\u00e1bor\""
> 
> i.e.
> 
> charToRaw("\"G\\u00e1bor\"")
> #>  [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
> 
> Returning the \u escaped form would make deparse() the inverse of
> parse(), at least in this respect.
> 
> Thank you,
> Gabor
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Gábor Csárdi

2022-Feb-22 09:53 UTC

head link

[Rd] deparse() and UTF-8 strings

I just saw a commit accidentally that adds iconv() support for the c99
\u escapes, which might or might not be accidental:
https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07

In any case, this is great, and very useful to have cross-platform for
it. Thank you!

Would it make sense to generate braced 4-digit \uxxxx sequences, to
make sure that they don't mix with the surrounding text?
I.e. \u{xxxx}? (Plus update the 6 to 8 twice.)
https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R746-R747

Also, it seems that we need a capital \U for the 8-digit sequences here:
https://github.com/wch/r-source/commit/f19b4ae7715eea1b18ef8368b4c2849a578ade07#diff-9a906ea3803721bf2aa8b802e98786c3b096727d87f1c423826e3bba4c112d76R753

Thank you again,
Gabor

On Mon, Feb 21, 2022 at 2:17 PM Brodie Gaslam <brodie.gaslam at yahoo.com>
wrote:>
> I'm not R-core, but happen to have run into this issue.
>
> I think this makes sense conceptually, and have had the same thought
> myself.  One implementation challenge is that the parser has a special
> branch for Unicode escape strings (e.g. "G\u00e1bor") that limits
such
> input to 10K wide characters, so the parser would need to be modified in
> order to make this a general solution:
>
>  > parse(text=sprintf('"%s"',
strrep("G\\u00e1bor", 2000)))
> Error in parse(text = sprintf("\"%s\"",
strrep("G\\u00e1bor", 2000))) :
>    string at line 1 containing Unicode escapes not in this locale
> is too long (max 10000 chars)
>
> Such strings are rare so maybe an interim solution is just to allow it
> for deparsing of shorter strings.  The parser modification itself would
> also have the benefit of speeding up parsing of strings without Unicode
> escapes.
>
> Best,
>
> B.
>
>
> On 2/21/22 5:33 AM, G?bor Cs?rdi wrote:
> > I am wondering if it would make sense to produce \u escaped strings in
> > deparse() for UTF-8 input. Currently we have (in R-devel):
> >
> > x <- "G\u00e1bor"
> > Sys.setlocale("LC_ALL", "C")
> > #> [1] "C/C/C/C/C/en_US.UTF-8"
> >
> > deparse(x)
> > #> [1] "\"G<U+00E1>bor\""
> >
> > charToRaw(deparse(x))
> > #> [1] 22 47 3c 55 2b 30 30 45 31 3e 62 6f 72 22
> >
> > Is there a reason why this is preferable instead of returning
> >
> > "\"G\\u00e1bor\""
> >
> > i.e.
> >
> > charToRaw("\"G\\u00e1bor\"")
> > #>  [1] 22 47 5c 75 30 30 65 31 62 6f 72 22
> >
> > Returning the \u escaped form would make deparse() the inverse of
> > parse(), at least in this respect.
> >
> > Thank you,
> > Gabor
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Feb 2022 - deparse() and UTF-8 strings

[Rd] deparse() and UTF-8 strings

[Rd] deparse() and UTF-8 strings