thr3ads.net - R devel - [Rd] nchar(x, type = "bytes") seems slower than it could be [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Hugh Parsonage

2021-Mar-30 06:02 UTC

[Rd] nchar(x, type = "bytes") seems slower than it could be

While profiling some C code, I rolled my own nchar function which
appears to be much faster than base R's (25 times faster for a 10M
length vector).  Obviously base::nchar provides significantly more
features than my barebones function (C snippet below); however, for
argument type = "bytes" it seems that the R_nchar and do_nchar
functions do not actually do anything more than this function.

My suspicion is that I have overlooked some subtlety in the base R
code, or that my benchmarks are not representative.  Alternatively,
the action in `do_nchar` of preparing the potential error message
before being passed to `R_nchar` may be quite costly indeed.  Or the
function cannot be unswitched from the more complex width and chars
arguments by the compiler.

If I haven't missed something, would a patch be warranted?

SEXP Cnchar(SEXP x) {
  R_xlen_t N = xlength(x);
  SEXP ans = PROTECT(allocVector(INTSXP, N));
  int * restrict ansp = INTEGER(ans);

  // Ignoring NA to avoid the branch has a very small
  // impact on performance.
  for (R_xlen_t i = 0; i < N; ++i) {
    SEXP sxi = STRING_ELT(x, i);
    if (sxi == NA_STRING) {
      ansp[i] = NA_INTEGER;
      continue;
    }
    ansp[i] = length(sxi);
  }
  UNPROTECT(1);
  return ans;
}

x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), 1e7)
Cnchar(x)
90ms
nchar(x, type = "bytes")
2500 ms

Tomas Kalibera

2021-Mar-30 08:20 UTC

head link

[Rd] nchar(x, type = "bytes") seems slower than it could be

Thanks for the report, you are probably running into the overhead of the 
eager creation of the error message. On my system, with your 
micro-benchmark, it is about 10x. I've tested simply by uncommenting it 
and re-running the benchmark. I'll fix (this is not a good task for a 
contributed patch).

Best,
Tomas

On 3/30/21 8:02 AM, Hugh Parsonage wrote:> While profiling some C code, I rolled my own nchar function which
> appears to be much faster than base R's (25 times faster for a 10M
> length vector).  Obviously base::nchar provides significantly more
> features than my barebones function (C snippet below); however, for
> argument type = "bytes" it seems that the R_nchar and do_nchar
> functions do not actually do anything more than this function.
> My suspicion is that I have overlooked some subtlety in the base R
> code, or that my benchmarks are not representative.  Alternatively,
> the action in `do_nchar` of preparing the potential error message
> before being passed to `R_nchar` may be quite costly indeed.  Or the
> function cannot be unswitched from the more complex width and chars
> arguments by the compiler.
>
> If I haven't missed something, would a patch be warranted?
>
> SEXP Cnchar(SEXP x) {
>    R_xlen_t N = xlength(x);
>    SEXP ans = PROTECT(allocVector(INTSXP, N));
>    int * restrict ansp = INTEGER(ans);
>
>    // Ignoring NA to avoid the branch has a very small
>    // impact on performance.
>    for (R_xlen_t i = 0; i < N; ++i) {
>      SEXP sxi = STRING_ELT(x, i);
>      if (sxi == NA_STRING) {
>        ansp[i] = NA_INTEGER;
>        continue;
>      }
>      ansp[i] = length(sxi);
>    }
>    UNPROTECT(1);
>    return ans;
> }
>
> x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5),
1e7)
> Cnchar(x)
> 90ms
> nchar(x, type = "bytes")
> 2500 ms
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Mar 2021 - nchar(x, type = "bytes") seems slower than it could be

[Rd] nchar(x, type = "bytes") seems slower than it could be

[Rd] nchar(x, type = "bytes") seems slower than it could be