Hugh Parsonage
2021-Mar-30 06:02 UTC
[Rd] nchar(x, type = "bytes") seems slower than it could be
While profiling some C code, I rolled my own nchar function which appears to be much faster than base R's (25 times faster for a 10M length vector). Obviously base::nchar provides significantly more features than my barebones function (C snippet below); however, for argument type = "bytes" it seems that the R_nchar and do_nchar functions do not actually do anything more than this function. My suspicion is that I have overlooked some subtlety in the base R code, or that my benchmarks are not representative. Alternatively, the action in `do_nchar` of preparing the potential error message before being passed to `R_nchar` may be quite costly indeed. Or the function cannot be unswitched from the more complex width and chars arguments by the compiler. If I haven't missed something, would a patch be warranted? SEXP Cnchar(SEXP x) { R_xlen_t N = xlength(x); SEXP ans = PROTECT(allocVector(INTSXP, N)); int * restrict ansp = INTEGER(ans); // Ignoring NA to avoid the branch has a very small // impact on performance. for (R_xlen_t i = 0; i < N; ++i) { SEXP sxi = STRING_ELT(x, i); if (sxi == NA_STRING) { ansp[i] = NA_INTEGER; continue; } ansp[i] = length(sxi); } UNPROTECT(1); return ans; } x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), 1e7) Cnchar(x) 90ms nchar(x, type = "bytes") 2500 ms
Tomas Kalibera
2021-Mar-30 08:20 UTC
[Rd] nchar(x, type = "bytes") seems slower than it could be
Thanks for the report, you are probably running into the overhead of the eager creation of the error message. On my system, with your micro-benchmark, it is about 10x. I've tested simply by uncommenting it and re-running the benchmark. I'll fix (this is not a good task for a contributed patch). Best, Tomas On 3/30/21 8:02 AM, Hugh Parsonage wrote:> While profiling some C code, I rolled my own nchar function which > appears to be much faster than base R's (25 times faster for a 10M > length vector). Obviously base::nchar provides significantly more > features than my barebones function (C snippet below); however, for > argument type = "bytes" it seems that the R_nchar and do_nchar > functions do not actually do anything more than this function. > My suspicion is that I have overlooked some subtlety in the base R > code, or that my benchmarks are not representative. Alternatively, > the action in `do_nchar` of preparing the potential error message > before being passed to `R_nchar` may be quite costly indeed. Or the > function cannot be unswitched from the more complex width and chars > arguments by the compiler. > > If I haven't missed something, would a patch be warranted? > > SEXP Cnchar(SEXP x) { > R_xlen_t N = xlength(x); > SEXP ans = PROTECT(allocVector(INTSXP, N)); > int * restrict ansp = INTEGER(ans); > > // Ignoring NA to avoid the branch has a very small > // impact on performance. > for (R_xlen_t i = 0; i < N; ++i) { > SEXP sxi = STRING_ELT(x, i); > if (sxi == NA_STRING) { > ansp[i] = NA_INTEGER; > continue; > } > ansp[i] = length(sxi); > } > UNPROTECT(1); > return ans; > } > > x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), 1e7) > Cnchar(x) > 90ms > nchar(x, type = "bytes") > 2500 ms > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel