thr3ads.net - R devel - [Rd] table() and as.character() performance for logical values [Apr 2025]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2025-Apr-10 15:53 UTC

[Rd] table() and as.character() performance for logical values

>>>>> Suharto Anggono Suharto Anggono via R-devel 
>>>>>     on Thu, 10 Apr 2025 07:53:04 +0000 (UTC) writes:
    > Chain of calls of C functions in coerce.c for
as.character(<logical>) in R:

    > do_asatomic
    > ascommon
    > coerceVector
    > coerceToString
    > StringFromLogical (for each element)

    > The definition of 'StringFromLogical' in coerce.c :

    > Chain of calls of C functions in coerce.c for
as.character(<logical>) in R:
    > 
    > do_asatomic
    > ascommon
    > coerceVector
    > coerceToString
    > StringFromLogical (for each element)
    > 
    > The definition of 'StringFromLogical' in coerce.c :
    > 
    > attribute_hidden SEXP StringFromLogical(int x, int *warn)
    > {
    >     int w;
    >     formatLogical(&x, 1, &w);
    >     if (x == NA_LOGICAL) return NA_STRING;
    >     else return mkChar(EncodeLogical(x, w));
    > }
    > 
    > The definition of 'EncodeLogical' in printutils.c :
    > 
    > const char *EncodeLogical(int x, int w)
    > {
    >     static char buff[NB];
    >     if(x == NA_LOGICAL) snprintf(buff, NB, "%*s", min(w,
(NB-1)), CHAR(R_print.na_string));
    >     else if(x) snprintf(buff, NB, "%*s", min(w, (NB-1)),
"TRUE");
    >     else snprintf(buff, NB, "%*s", min(w, (NB-1)),
"FALSE");
    >     buff[NB-1] = '\0';
    >     return buff;
    > }
    > 
    > > L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
    > > system.time(as.character(L))
    >    user  system elapsed
    >    2.69    0.02    2.73
    > > system.time(c("FALSE", "TRUE")[L+1])
    >    user  system elapsed
    >    0.15    0.04    0.20
    > > system.time(c("FALSE", "TRUE")[L+1L])
    >    user  system elapsed
    >    0.08    0.05    0.13
    > > L <- rep(NA, 10^7)
    > > system.time(as.character(L))
    >    user  system elapsed
    >    0.11    0.00    0.11
    > > system.time(c("FALSE", "TRUE")[L+1])
    >    user  system elapsed
    >    0.16    0.06    0.22
    > > system.time(c("FALSE", "TRUE")[L+1L])
    >    user  system elapsed
    >    0.09    0.03    0.12
    > 
    > `as.character` of a logical vector that is all NA is fast enough. 
    > It appears that the call to 'formatLogical' inside > the C
function
    > 'StringFromLogical' does not introduce much     > slowdown. 


    > I found that using string literal inside the C function
'StringFromLogical', by replacing
    > EncodeLogical(x, w)
    > with
    > x ? "TRUE" : "FALSE"
    > (and the call to 'formatLogical' is not needed anymore), make
it faster.

indeed! ... and we also notice that the 'w' argument is neither
needed anymore, and that makes sense: At this point when you
know you have a an R logical value there are only three
possibilities and no reason ever to warn about the conversion.

    > Alternatively, 
or in addition !

    > "fast path" could be introduced in 'EncodeLogical',
potentially also benefits format() in R.
    > For example, without replacing existing code, the following fragment
could be inserted.
    > 
    >     if(x == NA_LOGICAL) {if(w == R_print.na_width) return
CHAR(R_print.na_string);}
    >     else if(x) {if(w == 4) return "TRUE";}
    >     else {if(w == 5) return "FALSE";}
    > 
    > However, with either of them, c("FALSE",
"TRUE")[L+1L] is still faster than as.character(L) .
    > 
    > Precomputing or caching possible results of the C function
'StringFromLogical' allows as.character(L) to be as fast as
c("FALSE", "TRUE")[L+1L] in R. For example,
'StringFromLogical' could be changed to
    > 
    > attribute_hidden SEXP StringFromLogical(int x, int *warn)
    > {
    >     static SEXP TrueCh, FalseCh;
    >     if (x == NA_LOGICAL) return NA_STRING;
    >     else if (x) return TrueCh ? TrueCh : (TrueCh =
mkChar("TRUE"));
    >     else return FalseCh ? FalseCh : (FalseCh =
mkChar("FALSE"));
    > }

Indeed, and something along this line (storing the other two constant strings)
was also
my thought when seeing the
   mkChar(x ? "TRUE" : "FALSE)
you implicitly proposed above.

I'm looking into applying both speedups;
thank you very much, Suharto!

Martin


--
Martin Maechler
ETH Zurich  and  R Core team

Suharto Anggono Suharto Anggono

2025-Apr-11 10:05 UTC

head link

[Rd] table() and as.character() performance for logical values

On second thought, I wonder if the caching in my changed
'StringFromLogical' in my previous message is safe. While 'ans'
in the C function 'coerceToString' is protected, its element is also
protected. If the object corresponding to 'ans' is then no longer
protected, is it possible for the cached object 'TrueCh' or
'FalseCh' in 'StringFromLogical' to be garbage collected? If it
is, I think of clearing the cache for each first filling. For example, by
abusing 'warn' argument, the following is added to my changed
'StringFromLogical'.

 if (*warn) TrueCh = FalseCh = NULL;

Correspondingly, in 'coerceToString',

 warn = i == 0;

is inserted before

 SET_STRING_ELT(ans, i, StringFromLogical(LOGICAL_ELT(v, i), &warn));

for LGLSXP case.

---------------------
On Thursday, 10 April 2025 at 10:54:03 pm GMT+7, Martin Maechler <maechler at
stat.math.ethz.ch> wrote:

>>>>> Suharto Anggono Suharto Anggono via R-devel
>>>>>? ? on Thu, 10 Apr 2025 07:53:04 +0000 (UTC) writes:
? ? > Chain of calls of C functions in coerce.c for
as.character(<logical>) in R:

? ? > do_asatomic
? ? > ascommon
? ? > coerceVector
? ? > coerceToString
? ? > StringFromLogical (for each element)

? ? > The definition of 'StringFromLogical' in coerce.c :

? ? > Chain of calls of C functions in coerce.c for
as.character(<logical>) in R:
? ? >
? ? > do_asatomic
? ? > ascommon
? ? > coerceVector
? ? > coerceToString
? ? > StringFromLogical (for each element)
? ? >
? ? > The definition of 'StringFromLogical' in coerce.c :
? ? >
? ? > attribute_hidden SEXP StringFromLogical(int x, int *warn)
? ? > {
? ? >? ? int w;
? ? >? ? formatLogical(&x, 1, &w);
? ? >? ? if (x == NA_LOGICAL) return NA_STRING;
? ? >? ? else return mkChar(EncodeLogical(x, w));
? ? > }
? ? >
? ? > The definition of 'EncodeLogical' in printutils.c :
? ? >
? ? > const char *EncodeLogical(int x, int w)
? ? > {
? ? >? ? static char buff[NB];
? ? >? ? if(x == NA_LOGICAL) snprintf(buff, NB, "%*s", min(w,
(NB-1)), CHAR(R_print.na_string));
? ? >? ? else if(x) snprintf(buff, NB, "%*s", min(w, (NB-1)),
"TRUE");
? ? >? ? else snprintf(buff, NB, "%*s", min(w, (NB-1)),
"FALSE");
? ? >? ? buff[NB-1] = '\0';
? ? >? ? return buff;
? ? > }
? ? >
? ? > > L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
? ? > > system.time(as.character(L))
? ? >? ? user? system elapsed
? ? >? ? 2.69? ? 0.02? ? 2.73
? ? > > system.time(c("FALSE", "TRUE")[L+1])
? ? >? ? user? system elapsed
? ? >? ? 0.15? ? 0.04? ? 0.20
? ? > > system.time(c("FALSE", "TRUE")[L+1L])
? ? >? ? user? system elapsed
? ? >? ? 0.08? ? 0.05? ? 0.13
? ? > > L <- rep(NA, 10^7)
? ? > > system.time(as.character(L))
? ? >? ? user? system elapsed
? ? >? ? 0.11? ? 0.00? ? 0.11
? ? > > system.time(c("FALSE", "TRUE")[L+1])
? ? >? ? user? system elapsed
? ? >? ? 0.16? ? 0.06? ? 0.22
? ? > > system.time(c("FALSE", "TRUE")[L+1L])
? ? >? ? user? system elapsed
? ? >? ? 0.09? ? 0.03? ? 0.12
? ? >
? ? > `as.character` of a logical vector that is all NA is fast enough.
? ? > It appears that the call to 'formatLogical' inside > the C
function
? ? > 'StringFromLogical' does not introduce much? ? > slowdown.


? ? > I found that using string literal inside the C function
'StringFromLogical', by replacing
? ? > EncodeLogical(x, w)
? ? > with
? ? > x ? "TRUE" : "FALSE"
? ? > (and the call to 'formatLogical' is not needed anymore), make
it faster.

indeed! ... and we also notice that the 'w' argument is neither
needed anymore, and that makes sense: At this point when you
know you have a an R logical value there are only three
possibilities and no reason ever to warn about the conversion.

? ? > Alternatively,
or in addition !


? ? > "fast path" could be introduced in 'EncodeLogical',
potentially also benefits format() in R.
? ? > For example, without replacing existing code, the following fragment
could be inserted.
? ? >
? ? >? ? if(x == NA_LOGICAL) {if(w == R_print.na_width) return
CHAR(R_print.na_string);}
? ? >? ? else if(x) {if(w == 4) return "TRUE";}
? ? >? ? else {if(w == 5) return "FALSE";}
? ? >
? ? > However, with either of them, c("FALSE",
"TRUE")[L+1L] is still faster than as.character(L) .
? ? >
? ? > Precomputing or caching possible results of the C function
'StringFromLogical' allows as.character(L) to be as fast as
c("FALSE", "TRUE")[L+1L] in R. For example,
'StringFromLogical' could be changed to
? ? >
? ? > attribute_hidden SEXP StringFromLogical(int x, int *warn)
? ? > {
? ? >? ? static SEXP TrueCh, FalseCh;
? ? >? ? if (x == NA_LOGICAL) return NA_STRING;
? ? >? ? else if (x) return TrueCh ? TrueCh : (TrueCh =
mkChar("TRUE"));
? ? >? ? else return FalseCh ? FalseCh : (FalseCh =
mkChar("FALSE"));

? ? > }

Indeed, and something along this line (storing the other two constant strings)
was also
my thought when seeing the
? mkChar(x ? "TRUE" : "FALSE)
you implicitly proposed above.

I'm looking into applying both speedups;
thank you very much, Suharto!

Martin


--
Martin Maechler
ETH Zurich? and? R Core team
  
	[[alternative HTML version deleted]]

Suharto Anggono Suharto Anggono

2025-Apr-12 08:27 UTC

head link

[Rd] table() and as.character() performance for logical values

For NA case (x == NA_LOGICAL), if R_print.na_width > NB-1 , the "fast
path" for 'EncodeLogical' that I propose previously behaves
differently from the general case that truncates at (NB-1).

To be consistent with the general case,
if(w == R_print.na_width)
can be replaced with
if(w == R_print.na_width && w <= NB-1)
or
if(min(w, (NB-1)) == R_print.na_width)

Or, just remove the "fast path" for NA case. For example, replace

? ?if(x == NA_LOGICAL) {if(w == R_print.na_width) return
CHAR(R_print.na_string);}

with

? ?if(x == NA_LOGICAL) ;


By the way, the comment in 'formatLogical' implies that 5 "is the
widest it can be, so stop". It is not true if R_print.na_width > 5 .

The output of
print(c(FALSE, NA), na.print = "******")
is not as it should be.




----------------
On Thursday, 10 April 2025 at 10:54:03 pm GMT+7, Martin Maechler <maechler at
stat.math.ethz.ch> wrote:




>>>>> Suharto Anggono Suharto Anggono via R-devel 
>>>>>? ? on Thu, 10 Apr 2025 07:53:04 +0000 (UTC) writes:
? ? > Chain of calls of C functions in coerce.c for
as.character(<logical>) in R:

? ? > do_asatomic
? ? > ascommon
? ? > coerceVector
? ? > coerceToString
? ? > StringFromLogical (for each element)

? ? > The definition of 'StringFromLogical' in coerce.c :

? ? > Chain of calls of C functions in coerce.c for
as.character(<logical>) in R:
? ? > 
? ? > do_asatomic
? ? > ascommon
? ? > coerceVector
? ? > coerceToString
? ? > StringFromLogical (for each element)
? ? > 
? ? > The definition of 'StringFromLogical' in coerce.c :
? ? > 
? ? > attribute_hidden SEXP StringFromLogical(int x, int *warn)
? ? > {
? ? >? ? int w;
? ? >? ? formatLogical(&x, 1, &w);
? ? >? ? if (x == NA_LOGICAL) return NA_STRING;
? ? >? ? else return mkChar(EncodeLogical(x, w));
? ? > }
? ? > 
? ? > The definition of 'EncodeLogical' in printutils.c :
? ? > 
? ? > const char *EncodeLogical(int x, int w)
? ? > {
? ? >? ? static char buff[NB];
? ? >? ? if(x == NA_LOGICAL) snprintf(buff, NB, "%*s", min(w,
(NB-1)), CHAR(R_print.na_string));
? ? >? ? else if(x) snprintf(buff, NB, "%*s", min(w, (NB-1)),
"TRUE");
? ? >? ? else snprintf(buff, NB, "%*s", min(w, (NB-1)),
"FALSE");
? ? >? ? buff[NB-1] = '\0';
? ? >? ? return buff;
? ? > }
? ? > 
? ? > > L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
? ? > > system.time(as.character(L))
? ? >? ? user? system elapsed
? ? >? ? 2.69? ? 0.02? ? 2.73
? ? > > system.time(c("FALSE", "TRUE")[L+1])
? ? >? ? user? system elapsed
? ? >? ? 0.15? ? 0.04? ? 0.20
? ? > > system.time(c("FALSE", "TRUE")[L+1L])
? ? >? ? user? system elapsed
? ? >? ? 0.08? ? 0.05? ? 0.13
? ? > > L <- rep(NA, 10^7)
? ? > > system.time(as.character(L))
? ? >? ? user? system elapsed
? ? >? ? 0.11? ? 0.00? ? 0.11
? ? > > system.time(c("FALSE", "TRUE")[L+1])
? ? >? ? user? system elapsed
? ? >? ? 0.16? ? 0.06? ? 0.22
? ? > > system.time(c("FALSE", "TRUE")[L+1L])
? ? >? ? user? system elapsed
? ? >? ? 0.09? ? 0.03? ? 0.12
? ? > 
? ? > `as.character` of a logical vector that is all NA is fast enough. 
? ? > It appears that the call to 'formatLogical' inside > the C
function
? ? > 'StringFromLogical' does not introduce much? ? > slowdown. 


? ? > I found that using string literal inside the C function
'StringFromLogical', by replacing
? ? > EncodeLogical(x, w)
? ? > with
? ? > x ? "TRUE" : "FALSE"
? ? > (and the call to 'formatLogical' is not needed anymore), make
it faster.

indeed! ... and we also notice that the 'w' argument is neither
needed anymore, and that makes sense: At this point when you
know you have a an R logical value there are only three
possibilities and no reason ever to warn about the conversion.

? ? > Alternatively, 
or in addition !


? ? > "fast path" could be introduced in 'EncodeLogical',
potentially also benefits format() in R.
? ? > For example, without replacing existing code, the following fragment
could be inserted.
? ? > 
? ? >? ? if(x == NA_LOGICAL) {if(w == R_print.na_width) return
CHAR(R_print.na_string);}
? ? >? ? else if(x) {if(w == 4) return "TRUE";}
? ? >? ? else {if(w == 5) return "FALSE";}
? ? > 
? ? > However, with either of them, c("FALSE",
"TRUE")[L+1L] is still faster than as.character(L) .
? ? > 
? ? > Precomputing or caching possible results of the C function
'StringFromLogical' allows as.character(L) to be as fast as
c("FALSE", "TRUE")[L+1L] in R. For example,
'StringFromLogical' could be changed to
? ? > 
? ? > attribute_hidden SEXP StringFromLogical(int x, int *warn)
? ? > {
? ? >? ? static SEXP TrueCh, FalseCh;
? ? >? ? if (x == NA_LOGICAL) return NA_STRING;
? ? >? ? else if (x) return TrueCh ? TrueCh : (TrueCh =
mkChar("TRUE"));
? ? >? ? else return FalseCh ? FalseCh : (FalseCh =
mkChar("FALSE"));

? ? > }

Indeed, and something along this line (storing the other two constant strings)
was also
my thought when seeing the
? mkChar(x ? "TRUE" : "FALSE)
you implicitly proposed above.

I'm looking into applying both speedups;
thank you very much, Suharto!

Martin


--
Martin Maechler
ETH Zurich? and? R Core team

R devel - Apr 2025 - table() and as.character() performance for logical values

[Rd] table() and as.character() performance for logical values

[Rd] table() and as.character() performance for logical values

[Rd] table() and as.character() performance for logical values