Suharto Anggono Suharto Anggono
2025-Apr-10 07:53 UTC
[Rd] table() and as.character() performance for logical values
Chain?of?calls?of?C?functions?in?coerce.c?for?as.character(<logical>)?in?R: do_asatomic ascommon coerceVector coerceToString StringFromLogical?(for?each?element) The?definition?of?'StringFromLogical'?in?coerce.c?: attribute_hidden?SEXP?StringFromLogical(int?x,?int?*warn) { ????int?w; ????formatLogical(&x,?1,?&w); ????if?(x?==?NA_LOGICAL)?return?NA_STRING; ????else?return?mkChar(EncodeLogical(x,?w)); } The?definition?of?'EncodeLogical'?in?printutils.c?: const?char?*EncodeLogical(int?x,?int?w) { ????static?char?buff[NB]; ????if(x?==?NA_LOGICAL)?snprintf(buff,?NB,?"%*s",?min(w,?(NB-1)),?CHAR(R_print.na_string)); ????else?if(x)?snprintf(buff,?NB,?"%*s",?min(w,?(NB-1)),?"TRUE"); ????else?snprintf(buff,?NB,?"%*s",?min(w,?(NB-1)),?"FALSE"); ????buff[NB-1]?=?'\0'; ????return?buff; }>?L?<-?sample(c(TRUE,?FALSE),?10^7,?replace?=?TRUE) >?system.time(as.character(L))???user??system?elapsed ???2.69????0.02????2.73>?system.time(c("FALSE",?"TRUE")[L+1])???user??system?elapsed ???0.15????0.04????0.20>?system.time(c("FALSE",?"TRUE")[L+1L])???user??system?elapsed ???0.08????0.05????0.13>?L?<-?rep(NA,?10^7) >?system.time(as.character(L))???user??system?elapsed ???0.11????0.00????0.11>?system.time(c("FALSE",?"TRUE")[L+1])???user??system?elapsed ???0.16????0.06????0.22>?system.time(c("FALSE",?"TRUE")[L+1L])???user??system?elapsed ???0.09????0.03????0.12 `as.character`?of?a?logical?vector?that?is?all?NA?is?fast?enough.?It?appears?that?the?call?to?'formatLogical'?inside?the?C?function?'StringFromLogical'?does?not?introduce?much?slowdown. I?found?that?using?string?literal?inside?the?C?function?'StringFromLogical',?by?replacing EncodeLogical(x,?w) with x???"TRUE"?:?"FALSE" (and?the?call?to?'formatLogical'?is?not?needed?anymore),?make?it?faster. Alternatively,?"fast?path"?could?be?introduced?in?'EncodeLogical', potentially also benefits format() in R.?For?example,?without?replacing?existing?code,?the?following?fragment?could?be?inserted. ????if(x?==?NA_LOGICAL)?{if(w?==?R_print.na_width)?return?CHAR(R_print.na_string);} ????else?if(x)?{if(w?==?4)?return?"TRUE";} ????else?{if(w?==?5)?return?"FALSE";} However,?with?either?of?them,?c("FALSE",?"TRUE")[L+1L]?is?still?faster?than?as.character(L)?. Precomputing?or?caching?possible?results?of?the?C?function?'StringFromLogical'?allows?as.character(L)?to?be?as?fast?as?c("FALSE",?"TRUE")[L+1L]?in?R. For example, 'StringFromLogical' could be changed to attribute_hidden?SEXP?StringFromLogical(int?x,?int?*warn) { static SEXP TrueCh, FalseCh; ????if?(x?==?NA_LOGICAL)?return?NA_STRING; ????else?if (x) return TrueCh ? TrueCh : (TrueCh = mkChar("TRUE")); else return FalseCh ? FalseCh : (FalseCh = mkChar("FALSE")); } ---------------- On?21?Mar?2025,?at?8:26,?Karolis?Koncevi?ius?wrote:>?[You?don't?often?get?email?from?karolis.koncevicius?using?gmail.com.?Learn?why?this?is?important?at?https://aka.ms/LearnAboutSenderIdentification?] > >?I?was?calling?table()?on?some?long?logical?vectors?and?noticed?that?it?took?a?long?time. > >?Out?of?curiosity?I?checked?the?performance?of?table()?on?different?types,?and?had?some?unexpected?results: > >????C?<-?sample(c("yes",?"no"),?10^7,?replace?=?TRUE) >????F?<-?factor(sample(c("yes",?"no"),?10^7,?replace?=?TRUE)) >????N?<-?sample(c(1,0),?10^7,?replace?=?TRUE) >????I?<-?sample(c(1L,0L),?10^7,?replace?=?TRUE) >????L?<-?sample(c(TRUE,?FALSE),?10^7,?replace?=?TRUE) > >????????????????????????????#?ordered?by?execution?time >????????????????????????????#??user??system?elapsed >????system.time(table(F))??#??0.088??0.006??0.093 >????system.time(table(C))??#??0.208??0.017??0.224 >????system.time(table(I))??#??0.242??0.019??0.261 >????system.time(table(L))??#??0.665??0.015??0.680 >????system.time(table(N))??#??1.771??0.019??1.791 > > >?The?performance?for?Integers?and?specially?booleans?is?quite?surprising. >?After?investigating?the?source?of?table,?I?ended?up?on?the?reason?being??as.character()?: > >????system.time(as.character(L)) >??????user??system?elapsed >????0.461??0.002??0.462 > >?Even?a?manual?conversion?can?achieve?a?speed-up?by?a?factor?of?~7: > >????system.time(c("FALSE",?"TRUE")[L+1]) >??????user??system?elapsed >????0.061??0.006??0.067 > > >?Tested?on?4.4.3?as?well?as?devel?trunk. > >?Just?reporting?for?comments?and?attention. >?Karolis?K. >?______________________________________________ >?R-devel?using?r-project.org?mailing?list >?https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2025-Apr-10 15:53 UTC
[Rd] table() and as.character() performance for logical values
>>>>> Suharto Anggono Suharto Anggono via R-devel >>>>> on Thu, 10 Apr 2025 07:53:04 +0000 (UTC) writes:> Chain of calls of C functions in coerce.c for as.character(<logical>) in R: > do_asatomic > ascommon > coerceVector > coerceToString > StringFromLogical (for each element) > The definition of 'StringFromLogical' in coerce.c : > Chain of calls of C functions in coerce.c for as.character(<logical>) in R: > > do_asatomic > ascommon > coerceVector > coerceToString > StringFromLogical (for each element) > > The definition of 'StringFromLogical' in coerce.c : > > attribute_hidden SEXP StringFromLogical(int x, int *warn) > { > int w; > formatLogical(&x, 1, &w); > if (x == NA_LOGICAL) return NA_STRING; > else return mkChar(EncodeLogical(x, w)); > } > > The definition of 'EncodeLogical' in printutils.c : > > const char *EncodeLogical(int x, int w) > { > static char buff[NB]; > if(x == NA_LOGICAL) snprintf(buff, NB, "%*s", min(w, (NB-1)), CHAR(R_print.na_string)); > else if(x) snprintf(buff, NB, "%*s", min(w, (NB-1)), "TRUE"); > else snprintf(buff, NB, "%*s", min(w, (NB-1)), "FALSE"); > buff[NB-1] = '\0'; > return buff; > } > > > L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE) > > system.time(as.character(L)) > user system elapsed > 2.69 0.02 2.73 > > system.time(c("FALSE", "TRUE")[L+1]) > user system elapsed > 0.15 0.04 0.20 > > system.time(c("FALSE", "TRUE")[L+1L]) > user system elapsed > 0.08 0.05 0.13 > > L <- rep(NA, 10^7) > > system.time(as.character(L)) > user system elapsed > 0.11 0.00 0.11 > > system.time(c("FALSE", "TRUE")[L+1]) > user system elapsed > 0.16 0.06 0.22 > > system.time(c("FALSE", "TRUE")[L+1L]) > user system elapsed > 0.09 0.03 0.12 > > `as.character` of a logical vector that is all NA is fast enough. > It appears that the call to 'formatLogical' inside > the C function > 'StringFromLogical' does not introduce much > slowdown. > I found that using string literal inside the C function 'StringFromLogical', by replacing > EncodeLogical(x, w) > with > x ? "TRUE" : "FALSE" > (and the call to 'formatLogical' is not needed anymore), make it faster. indeed! ... and we also notice that the 'w' argument is neither needed anymore, and that makes sense: At this point when you know you have a an R logical value there are only three possibilities and no reason ever to warn about the conversion. > Alternatively, or in addition ! > "fast path" could be introduced in 'EncodeLogical', potentially also benefits format() in R. > For example, without replacing existing code, the following fragment could be inserted. > > if(x == NA_LOGICAL) {if(w == R_print.na_width) return CHAR(R_print.na_string);} > else if(x) {if(w == 4) return "TRUE";} > else {if(w == 5) return "FALSE";} > > However, with either of them, c("FALSE", "TRUE")[L+1L] is still faster than as.character(L) . > > Precomputing or caching possible results of the C function 'StringFromLogical' allows as.character(L) to be as fast as c("FALSE", "TRUE")[L+1L] in R. For example, 'StringFromLogical' could be changed to > > attribute_hidden SEXP StringFromLogical(int x, int *warn) > { > static SEXP TrueCh, FalseCh; > if (x == NA_LOGICAL) return NA_STRING; > else if (x) return TrueCh ? TrueCh : (TrueCh = mkChar("TRUE")); > else return FalseCh ? FalseCh : (FalseCh = mkChar("FALSE")); > } Indeed, and something along this line (storing the other two constant strings) was also my thought when seeing the mkChar(x ? "TRUE" : "FALSE) you implicitly proposed above. I'm looking into applying both speedups; thank you very much, Suharto! Martin -- Martin Maechler ETH Zurich and R Core team