Ivan Krylov
2023-Sep-18 21:33 UTC
[Rd] On PRINTNAME() encoding, EncodeChar(), and being painted into a corner
Hello R-devel, I have originally learned about this from the following GitHub issue: <https://github.com/r-devel/r-project-sprint-2023/issues/65>. In short, in various places of the R source code, symbol names are accessed using translateChar(), EncodeChar(), and CHAR(), and it might help to unify their use. Currently, R is very careful to only create symbols with names in the native encoding. I have verified this by tracing the ways a symbol can be created (allocSExp) or have a name assigned (SET_PRINTNAME) using static analysis (Coccinelle). While it's possible to create a symbol with a name in an encoding different from the native encoding using SET_PRINTNAME(symbol, mkCharCE(...)), neither R nor CRAN packages invoke code like this for an arbitrary encoding; symbols are always created using either install() or installTrChar(). (install("invalid byte sequence") is, of course, still possible, but is a different problem.) This means that translateChar(PRINTNAME(...)) is currently unnecessary, but it may be worth adding a check (opt-in, applicable only during R CMD check, to avoid a performance hit?) to SET_PRINTNAME() to ensure that only native-encoding (or ASCII) symbol names are used. I could also suggest a patch for Writing R Extensions or R Internals to document this assumption. The following translateChar() doesn't hurt (it returns CHAR(x) right away without allocating any memory), but it stands out against most uses of CHAR(PRINTNAME(.)) and EncodeChar(PRINTNAME(.)): --- src/main/subscript.c (revision 85160) +++ src/main/subscript.c (working copy) @@ -186,7 +186,7 @@ PROTECT(names); for (i = 0; i < nx; i++) if (streql(translateChar(STRING_ELT(names, i)), - translateChar(PRINTNAME(s)))) { + CHAR(PRINTNAME(s)))) { indx = i; break; } The following translateChar() can be safely replaced with EncodeChar(), correctly printing funnily-named functions in tracemem() reports: --- src/main/debug.c (revision 85160) +++ src/main/debug.c (working copy) @@ -203,7 +203,7 @@ && TYPEOF(cptr->call) == LANGSXP) { SEXP fun = CAR(cptr->call); Rprintf("%s ", - TYPEOF(fun) == SYMSXP ? translateChar(PRINTNAME(fun)) : + TYPEOF(fun) == SYMSXP ? EncodeChar(PRINTNAME(fun)) : "<Anonymous>"); } } tracemem(a <- 1:10) `\r\v\t\n` <- function(x) x[1] <- 0 `\r\v\t\n`(a) # Now correctly prints: # tracemem[0x55fd11e61e00 -> 0x55fd1081d2a8]: \r\v\t\n # tracemem[0x55fd1081d2a8 -> 0x55fd113277e8]: \r\v\t\n What about EncodeChar(PRINTNAME(.))? This is the intended way to report symbols in error messages. Without EncodeChar(), .Internal(`\r\v\t\n`()) actually prints the newlines to standard output as part of the error message instead of escaping them. Unfortunately, EncodeChar() uses a statically-allocated buffer for its return value, *and* the comments say that it's unsafe to use together with errorcall(): errorcall_cpy() must be used instead. I think that's because EncodeChar() may be called while deparsing the call, overwriting the statically-allocated buffer before the format arguments (which also contain the return value of EncodeChar()) are processed. In particular, this means that EncodeChar() is unsafe to use with any kind of warnings. The following Coccinelle script locates uses of CHAR(PRINTNAME(.)) inside errors and warnings: @@ expression x; expression list arg1, arg2; identifier fun =~ "(Rf_)?(error|warning)(call)?(_cpy)?"; @@ fun( arg1, * CHAR(PRINTNAME(x)), arg2 ) Some of these, which already use errorcall(), are trivial to fix by replacing CHAR() with EncodeChar() and upgrading errorcall() to errorcall_cpy(): --- src/main/names.c +++ src/main/names.c @@ -1367,7 +1367,7 @@ attribute_hidden SEXP do_internal(SEXP c errorcall(call, _("invalid .Internal() argument")); if (INTERNAL(fun) == R_NilValue) - errorcall(call, _("there is no .Internal function '%s'"), + errorcall_cpy(call, _("there is no .Internal function '%s'"), - CHAR(PRINTNAME(fun))); + EncodeChar(PRINTNAME(fun))); #ifdef CHECK_INTERNALS if(R_Is_Running > 1 && getenv("_R_CHECK_INTERNALS2_")) { --- src/main/eval.c +++ src/main/eval.c @@ -1161,7 +1161,7 @@ SEXP eval(SEXP e, SEXP rho) const char *n = CHAR(PRINTNAME(e)); - if(*n) errorcall(getLexicalCall(rho), + if(*n) errorcall_cpy(getLexicalCall(rho), _("argument \"%s\" is missing, with no default"), - CHAR(PRINTNAME(e))); + EncodeChar(PRINTNAME(e))); else errorcall(getLexicalCall(rho), _("argument is missing, with no default")); } --- src/main/match.c +++ src/main/match.c @@ -229,7 +229,7 @@ attribute_hidden SEXP matchArgs_NR(SEXP if (fargused[arg_i] == 2) - errorcall(call, + errorcall_cpy(call, _("formal argument \"%s\" matched by multiple actual arguments"), - CHAR(PRINTNAME(TAG(f)))); + EncodeChar(PRINTNAME(TAG(f)))); if (ARGUSED(b) == 2) errorcall(call, _("argument %d matches multiple formal arguments"), @@ -272,12 +271,12 @@ attribute_hidden SEXP matchArgs_NR(SEXP if (fargused[arg_i] == 1) - errorcall(call, + errorcall_cpy(call, _("formal argument \"%s\" matched by multiple actual arguments"), - CHAR(PRINTNAME(TAG(f)))); + EncodeChar(PRINTNAME(TAG(f)))); if (R_warn_partial_match_args) { warningcall(call, _("partial argument match of '%s' to '%s'"), CHAR(PRINTNAME(TAG(b))), CHAR(PRINTNAME(TAG(f))) ); } SETCAR(a, CAR(b)); if (CAR(b) != R_MissingArg) SET_MISSING(a, 0); The changes become more complicated with a plain error() (have to figure out the current call and provide it to errorcall_cpy), still more complicated with warnings (there's currently no warningcall_cpy(), though one can be implemented) and even more complicated when multiple symbols are used in the same warning or error, like in the last warningcall() above (EncodeChar() can only be called once at a time). The only solution to the latter problem is an EncodeChar() variant that allocates its memory dynamically. Would R_alloc() be acceptable in this context? With errors, the allocation stack would be quickly reset (except when withCallingHandlers() is in effect?), but with warnings, the code would have to restore it manually every time. Is it even worth the effort to try to handle the (pretty rare) non-syntactic symbol names while constructing error messages? Other languages (like Lua or SQLite) provide a special printf specifier (typically %q) to create quoted/escaped string representations, but we're not yet at the point of providing a C-level printf implementation. -- Best regards, Ivan
iuke-tier@ey m@iii@g oii uiow@@edu
2023-Sep-22 21:14 UTC
[Rd] [External] On PRINTNAME() encoding, EncodeChar(), and being painted into a corner
Thanks for looking into this! On Mon, 18 Sep 2023, Ivan Krylov wrote:> Hello R-devel, > > I have originally learned about this from the following GitHub issue: > <https://github.com/r-devel/r-project-sprint-2023/issues/65>. In short, > in various places of the R source code, symbol names are accessed using > translateChar(), EncodeChar(), and CHAR(), and it might help to unify > their use. > > Currently, R is very careful to only create symbols with names in the > native encoding. I have verified this by tracing the ways a symbol can > be created (allocSExp) or have a name assigned (SET_PRINTNAME) using > static analysis (Coccinelle). While it's possible to create a symbol > with a name in an encoding different from the native encoding using > SET_PRINTNAME(symbol, mkCharCE(...)), neither R nor CRAN packages > invoke code like this for an arbitrary encoding; symbols are always > created using either install() or installTrChar(). (install("invalid > byte sequence") is, of course, still possible, but is a different > problem.)SET_PRINTNAME is not in the API and not in the public header files so this should not be an issue. It would probably be best to refactor things so SET_PRINTNAME only exists in memory.c> This means that translateChar(PRINTNAME(...)) is currently unnecessary, > but it may be worth adding a check (opt-in, applicable only during R > CMD check, to avoid a performance hit?) to SET_PRINTNAME() to ensure > that only native-encoding (or ASCII) symbol names are used. I could also > suggest a patch for Writing R Extensions or R Internals to document this > assumption. > > The following translateChar() doesn't hurt (it returns CHAR(x) right > away without allocating any memory), but it stands out against most > uses of CHAR(PRINTNAME(.)) and EncodeChar(PRINTNAME(.)): > > --- src/main/subscript.c (revision 85160) > +++ src/main/subscript.c (working copy) > @@ -186,7 +186,7 @@ > PROTECT(names); > for (i = 0; i < nx; i++) > if (streql(translateChar(STRING_ELT(names, i)), > - translateChar(PRINTNAME(s)))) { > + CHAR(PRINTNAME(s)))) { > indx = i; > break; > } > > The following translateChar() can be safely replaced with EncodeChar(), > correctly printing funnily-named functions in tracemem() reports: > > --- src/main/debug.c (revision 85160) > +++ src/main/debug.c (working copy) > @@ -203,7 +203,7 @@ > && TYPEOF(cptr->call) == LANGSXP) { > SEXP fun = CAR(cptr->call); > Rprintf("%s ", > - TYPEOF(fun) == SYMSXP ? translateChar(PRINTNAME(fun)) : > + TYPEOF(fun) == SYMSXP ? EncodeChar(PRINTNAME(fun)) : "<Anonymous>"); > } > } > > tracemem(a <- 1:10) > `\r\v\t\n` <- function(x) x[1] <- 0 > `\r\v\t\n`(a) > # Now correctly prints: > # tracemem[0x55fd11e61e00 -> 0x55fd1081d2a8]: \r\v\t\n > # tracemem[0x55fd1081d2a8 -> 0x55fd113277e8]: \r\v\t\nSounds good. I've made those two changes in the trunk in r85209.> What about EncodeChar(PRINTNAME(.))? This is the intended way to report > symbols in error messages. Without EncodeChar(), > .Internal(`\r\v\t\n`()) actually prints the newlines to standard output > as part of the error message instead of escaping them. Unfortunately, > EncodeChar() uses a statically-allocated buffer for its return value, > *and* the comments say that it's unsafe to use together with > errorcall(): errorcall_cpy() must be used instead. I think that's > overwriting the statically-allocated buffer before the format arguments > (which also contain the return value of EncodeChar()) are processed. In > particular, this means that EncodeChar() is unsafe to use with any kind > of warnings. The following Coccinelle script locates uses of > CHAR(PRINTNAME(.)) inside errors and warnings: > @@ > expression x; > expression list arg1, arg2; > identifier fun =~ "(Rf_)?(error|warning)(call)?(_cpy)?"; > @@ > fun( > arg1, > * CHAR(PRINTNAME(x)), > arg2 > ) > > Some of these, which already use errorcall(), are trivial to fix by > replacing CHAR() with EncodeChar() and upgrading errorcall() to > errorcall_cpy():I think it would be best to modify errorcall so errorcall_cpy is not necessary. As things are now it is just too easy to forget that sometimes errorcall_cpy should be used (and this has lead to some bugs recently).> --- src/main/names.c > +++ src/main/names.c > @@ -1367,7 +1367,7 @@ attribute_hidden SEXP do_internal(SEXP c > errorcall(call, _("invalid .Internal() argument")); > if (INTERNAL(fun) == R_NilValue) > - errorcall(call, _("there is no .Internal function '%s'"), > + errorcall_cpy(call, _("there is no .Internal function '%s'"), > - CHAR(PRINTNAME(fun))); > + EncodeChar(PRINTNAME(fun))); > > #ifdef CHECK_INTERNALS > if(R_Is_Running > 1 && getenv("_R_CHECK_INTERNALS2_")) { > > --- src/main/eval.c > +++ src/main/eval.c > @@ -1161,7 +1161,7 @@ SEXP eval(SEXP e, SEXP rho) > const char *n = CHAR(PRINTNAME(e)); > - if(*n) errorcall(getLexicalCall(rho), > + if(*n) errorcall_cpy(getLexicalCall(rho), > _("argument \"%s\" is missing, with no default"), > - CHAR(PRINTNAME(e))); > + EncodeChar(PRINTNAME(e))); > else errorcall(getLexicalCall(rho), > _("argument is missing, with no default")); > } > > --- src/main/match.c > +++ src/main/match.c > @@ -229,7 +229,7 @@ attribute_hidden SEXP matchArgs_NR(SEXP > if (fargused[arg_i] == 2) > - errorcall(call, > + errorcall_cpy(call, > _("formal argument \"%s\" matched by multiple actual arguments"), > - CHAR(PRINTNAME(TAG(f)))); > + EncodeChar(PRINTNAME(TAG(f)))); > if (ARGUSED(b) == 2) > errorcall(call, > _("argument %d matches multiple formal arguments"), > @@ -272,12 +271,12 @@ attribute_hidden SEXP matchArgs_NR(SEXP > if (fargused[arg_i] == 1) > - errorcall(call, > + errorcall_cpy(call, > _("formal argument \"%s\" matched by multiple actual arguments"), > - CHAR(PRINTNAME(TAG(f)))); > + EncodeChar(PRINTNAME(TAG(f)))); > if (R_warn_partial_match_args) { > warningcall(call, > _("partial argument match of '%s' to '%s'"), CHAR(PRINTNAME(TAG(b))), > CHAR(PRINTNAME(TAG(f))) ); > } > SETCAR(a, CAR(b)); > if (CAR(b) != R_MissingArg) SET_MISSING(a, 0); > > The changes become more complicated with a plain error() (have to > figure out the current call and provide it to errorcall_cpy), still > more complicated with warnings (there's currently no warningcall_cpy(), > though one can be implemented) and even more complicated when multiple > symbols are used in the same warning or error, like in the last > warningcall() above (EncodeChar() can only be called once at a time). > > The only solution to the latter problem is an EncodeChar() variant that > allocates its memory dynamically. Would R_alloc() be acceptable in this > context? With errors, the allocation stack would be quickly reset > (except when withCallingHandlers() is in effect?), but with warnings, > the code would have to restore it manually every time.Or allow/require a buffer to be provided. So replacing the calls like CHAR(PRINTNAME(sym)) with EncodeSymbol(sym, buf, buf_size)> Is it even worth > the effort to try to handle the (pretty rare) non-syntactic symbol names > while constructing error messages? Other languages (like Lua or SQLite) > provide a special printf specifier (typically %q) to create > quoted/escaped string representations, but we're not yet at the point > of providing a C-level printf implementation.Not clear it is worth it. But the situation now is not good, because sometimes we encode and sometimes we don't. It would be better to be consistent, both for the end user and for maintainers who now have to spend time figuring out which way to go. Best, luke -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu