thr3ads.net - R devel - [Rd] table() and as.character() performance for logical values [Apr 2025]

If this information is useful, please help other people find it:
Share via:

Suharto Anggono Suharto Anggono

2025-Apr-09 06:26 UTC

[Rd] table() and as.character() performance for logical values

With the change to 'factor',
factor(1L, levels = TRUE)
doesn't give NA, different from
factor(1, levels = TRUE)

With the change to 'factor',
factor(TRUE, levels = 1L)
and
factor(TRUE, levels = 1)
don't give NA.

With the change to 'factor',
factor(2L, levels = sqrt(2)^2)
gives NA, different from
factor(2, levels = sqrt(2)^2)

With the change to 'factor',
factor(2L, exclude = sqrt(2)^2)
has 1 level (nothing is excluded), different from
factor(2, exclude = sqrt(2)^2)

------------
Am 21.03.25 um 15:42 schrieb Aidan Lakshman via R-devel:>> After investigating the source of table, I ended up on the reason being
?as.character()?:
> 
> This is specifically happening within the conversion of the input to type
factor, which is where the as.character conversion happens.
Yes, I also think 'factor' could do a bit better for unclassed integers 
(such as when called from 'cut') as well as for logical input (such as 
from 'summary' -> 'table').

Note that 'as.factor' already has a "fast track" for plain
integers
(originally for 'split.default' from 'tapply'), so can be used
instead
of 'factor' when there is no need for custom 'levels',
'labels', or
'exclude'. (Thanks for already mentioning 'tabulate'.)

A 'factor' patch would apply more broadly, e.g.:

==================================================================---
src/library/base/R/factor.R	(Revision 88042)
+++ src/library/base/R/factor.R	(Arbeitskopie)
@@ -20,14 +20,18 @@
                     exclude = NA, ordered = is.ordered(x), nmax = NA)
  {
      if(is.null(x)) x <- character()
+    directmatch <- !is.object(x) &&
+        (is.character(x) || is.integer(x) || is.logical(x))
      nx <- names(x)
      if (missing(levels)) {
  	y <- unique(x, nmax = nmax)
  	ind <- order(y)
-	levels <- unique(as.character(y)[ind])
+        if (!directmatch)
+            y <- as.character(y)
+	levels <- unique(y[ind])
      }
      force(ordered) # check if original x is an ordered factor
-    if(!is.character(x))
+    if(!directmatch)
  	x <- as.character(x)
      ## levels could be a long vector, but match will not handle that.
      levels <- levels[is.na(match(levels, exclude))]
      f <- match(x, levels)
==================================================================
This skips as.character() also for integer/logical 'x' and would indeed 
bring table() runtimes "in order":

     set.seed(1)
     C <- sample(c("no", "yes"), 10^7, replace = TRUE)
     F <- as.factor(C)
     L <- F == "yes"
     I <- as.integer(L)
     N <- as.numeric(I)

     ## Median system.time(table(.)) in ms:
     ## table(F)   256
     ## table(I)   384   # not  696
     ## table(L)   409   # not 1159
     ## table(C)   591
     ## table(N)  3324

The (seemingly) small patch passes check-all, but maybe it overlooks 
some edge cases. I'd test it on a subset of CRAN/BIOC packages.

Best,

	Sebastian Meyer
> 
>    # Timing is all on my local machine (OSX)
>    N_v <- sample(c(1,0), 10^7, replace = TRUE)
>    L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>                                           #  user  system elapsed
>    system.time(table(N_v))                # 2.155   0.039   2.192
>    system.time(table(L_v))                # 0.806   0.030   0.838
> 
>    system.time(N_fv <- as.factor(N_v))    # 2.026   0.024   2.050
>    system.time(L_fv <- as.factor(L_v))    # 0.668   0.015   0.683
> 
>    system.time(table(N_fv))               # 0.133   0.022   0.156
>    system.time(table(L_fv))               # 0.134   0.018   0.151
> 
>> The performance for Integers and specially booleans is quite
surprising.
> 
> Of note is that the performance is significantly better if using
`tabulate`, since this doesn't involve a conversion to factor (though input
must be numeric/factor, results aren't named, and it has worse handling of
NA values). If you have performance critical calls like this you could consider
using `tabulate` instead.
> 
>    system.time(tabulate(N_v))             # 0.054   0.002   0.056
>    system.time(tabulate(as.integer(L_v))) # 0.052   0.002   0.055
> 
> 
> I don't know if this is a known issue or not; most of my colleagues are
aware of the slow-down and use `tabulate` when performance is required. My
understanding was that the slower performance is a trade-off for more consistent
performance (better output, better handling of ambiguities/NA, etc.), and that
speed isn't the highest priority with `table`. Maybe someone else has a
better understanding of the history of the function.
> 
> As for improving the speed, it would basically come down to refactoring
`table` to not use a `factor` conversion. I'd be concerned about introducing
a lot of edge cases with that, but it's theoretically possible. Based on 30
seconds of thinking, it may be possible to do something like:
> 
> ## just a sketch of a barebones non-factor implementation
>    test_tab <- function(x){
>      lookup <- unique(x)
>      counts <- tabulate(match(x, lookup))
>      names(counts) <- as.character(lookup)
>      counts
>    }
> 
>    system.time(test_tab(L_v))  # 0.101   0.006   0.107
>    system.time(test_tab(N_v))  # 0.129   0.015   0.144
> 
> This is also faster in the case where there are lots of categories with few
entries per category:
> 
>    N_v2 <- 1:1e7
>    system.time(test_tab(N_v2)) # 0.383   0.024   0.411
>    system.time(table(N_v2))    # 6.122   0.228   6.398
> 
> Obviously there are some big shortcomings:
> - it's missing a lot of error checking etc. that the standard `table`
has
> - it only works with 1D vectors
> - NA handling isn't quite the same as `table` (though it would be easy
to adapt)
> 
> Just including to potentially start discussion for optimization.
> 
> For reference, the relevant section is in src/library/base/R/table.R:L75-85
> 
> -Aidan
> 
> -----------------------
> Aidan Lakshman (he/him)
> http://www.ahl27.com/
> 
> On 21 Mar 2025, at 8:26, Karolis Koncevi?ius wrote:
> 
>> [You don't often get email from karolis.koncevicius using
gmail.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]
>>
>> I was calling table() on some long logical vectors and noticed that it
took a long time.
>>
>> Out of curiosity I checked the performance of table() on different
types, and had some unexpected results:
>>
>>      C <- sample(c("yes", "no"), 10^7, replace =
TRUE)
>>      F <- factor(sample(c("yes", "no"), 10^7,
replace = TRUE))
>>      N <- sample(c(1,0), 10^7, replace = TRUE)
>>      I <- sample(c(1L,0L), 10^7, replace = TRUE)
>>      L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>>
>>                             # ordered by execution time
>>                             #   user  system elapsed
>>      system.time(table(F))  #  0.088   0.006   0.093
>>      system.time(table(C))  #  0.208   0.017   0.224
>>      system.time(table(I))  #  0.242   0.019   0.261
>>      system.time(table(L))  #  0.665   0.015   0.680
>>      system.time(table(N))  #  1.771   0.019   1.791
>>
>>
>> The performance for Integers and specially booleans is quite
surprising.
>> After investigating the source of table, I ended up on the reason being
?as.character()?:
>>
>>      system.time(as.character(L))
>>       user  system elapsed
>>      0.461   0.002   0.462
>>
>> Even a manual conversion can achieve a speed-up by a factor of ~7:
>>
>>      system.time(c("FALSE", "TRUE")[L+1])
>>       user  system elapsed
>>      0.061   0.006   0.067
>>
>>
>> Tested on 4.4.3 as well as devel trunk.
>>
>> Just reporting for comments and attention.
>> Karolis K.
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Sebastian Meyer

2025-Apr-09 21:51 UTC

head link

[Rd] table() and as.character() performance for logical values

Right, thanks! These are non-standard uses of factor(), edge cases I 
alluded to. We didn't see any problems with the patch in existing tests 
nor in CRAN/BIOC package checks. Note that 'levels' is documented as

     an optional vector of the unique values (as character
     strings) that ?x? might have taken.

E.g.: factor(1L, levels = "1") is identical to factor(1, levels =
"1").
Using integers (or logicals) for *both* 'x' and 'levels' still
works
(and is used by, e.g., cut.default() internally, is more efficient now, 
and could be documented), but that 2L fails to match sqrt(2)^2 doesn't 
really come as a surprise.

I'm not sure if it is worth special-casing integer/logical 'x' with 
specified non-character 'levels' of a non-conforming type, but yes, 
*not* skipping as.character() would then give the more consistent 
undocumented behaviour from before the performance patch. Maybe 
something to consider for 4.5.1.

	Sebastian Meyer


Am 09.04.25 um 08:26 schrieb Suharto Anggono Suharto Anggono via
R-devel:> With the change to 'factor',
> factor(1L, levels = TRUE)
> doesn't give NA, different from
> factor(1, levels = TRUE)
> 
> With the change to 'factor',
> factor(TRUE, levels = 1L)
> and
> factor(TRUE, levels = 1)
> don't give NA.
> 
> With the change to 'factor',
> factor(2L, levels = sqrt(2)^2)
> gives NA, different from
> factor(2, levels = sqrt(2)^2)
> 
> With the change to 'factor',
> factor(2L, exclude = sqrt(2)^2)
> has 1 level (nothing is excluded), different from
> factor(2, exclude = sqrt(2)^2)
> 
> ------------
> Am 21.03.25 um 15:42 schrieb Aidan Lakshman via R-devel:
>>> After investigating the source of table, I ended up on the reason
being ?as.character()?:
>>
>> This is specifically happening within the conversion of the input to
type factor, which is where the as.character conversion happens.
> 
> Yes, I also think 'factor' could do a bit better for unclassed
integers
> (such as when called from 'cut') as well as for logical input (such
as
> from 'summary' -> 'table').
> 
> Note that 'as.factor' already has a "fast track" for
plain integers
> (originally for 'split.default' from 'tapply'), so can be
used instead
> of 'factor' when there is no need for custom 'levels',
'labels', or
> 'exclude'. (Thanks for already mentioning 'tabulate'.)
> 
> A 'factor' patch would apply more broadly, e.g.:
> 
> ==================================================================> ---
src/library/base/R/factor.R	(Revision 88042)
> +++ src/library/base/R/factor.R	(Arbeitskopie)
> @@ -20,14 +20,18 @@
>                       exclude = NA, ordered = is.ordered(x), nmax = NA)
>    {
>        if(is.null(x)) x <- character()
> +    directmatch <- !is.object(x) &&
> +        (is.character(x) || is.integer(x) || is.logical(x))
>        nx <- names(x)
>        if (missing(levels)) {
>    	y <- unique(x, nmax = nmax)
>    	ind <- order(y)
> -	levels <- unique(as.character(y)[ind])
> +        if (!directmatch)
> +            y <- as.character(y)
> +	levels <- unique(y[ind])
>        }
>        force(ordered) # check if original x is an ordered factor
> -    if(!is.character(x))
> +    if(!directmatch)
>    	x <- as.character(x)
>        ## levels could be a long vector, but match will not handle that.
>        levels <- levels[is.na(match(levels, exclude))]
>        f <- match(x, levels)
> ==================================================================> 
> This skips as.character() also for integer/logical 'x' and would
indeed
> bring table() runtimes "in order":
> 
>       set.seed(1)
>       C <- sample(c("no", "yes"), 10^7, replace =
TRUE)
>       F <- as.factor(C)
>       L <- F == "yes"
>       I <- as.integer(L)
>       N <- as.numeric(I)
> 
>       ## Median system.time(table(.)) in ms:
>       ## table(F)   256
>       ## table(I)   384   # not  696
>       ## table(L)   409   # not 1159
>       ## table(C)   591
>       ## table(N)  3324
> 
> The (seemingly) small patch passes check-all, but maybe it overlooks
> some edge cases. I'd test it on a subset of CRAN/BIOC packages.
> 
> Best,
> 
> 	Sebastian Meyer
> 
>>
>>     # Timing is all on my local machine (OSX)
>>     N_v <- sample(c(1,0), 10^7, replace = TRUE)
>>     L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>>                                            #  user  system elapsed
>>     system.time(table(N_v))                # 2.155   0.039   2.192
>>     system.time(table(L_v))                # 0.806   0.030   0.838
>>
>>     system.time(N_fv <- as.factor(N_v))    # 2.026   0.024   2.050
>>     system.time(L_fv <- as.factor(L_v))    # 0.668   0.015   0.683
>>
>>     system.time(table(N_fv))               # 0.133   0.022   0.156
>>     system.time(table(L_fv))               # 0.134   0.018   0.151
>>
>>> The performance for Integers and specially booleans is quite
surprising.
>>
>> Of note is that the performance is significantly better if using
`tabulate`, since this doesn't involve a conversion to factor (though input
must be numeric/factor, results aren't named, and it has worse handling of
NA values). If you have performance critical calls like this you could consider
using `tabulate` instead.
>>
>>     system.time(tabulate(N_v))             # 0.054   0.002   0.056
>>     system.time(tabulate(as.integer(L_v))) # 0.052   0.002   0.055
>>
>>
>> I don't know if this is a known issue or not; most of my colleagues
are aware of the slow-down and use `tabulate` when performance is required. My
understanding was that the slower performance is a trade-off for more consistent
performance (better output, better handling of ambiguities/NA, etc.), and that
speed isn't the highest priority with `table`. Maybe someone else has a
better understanding of the history of the function.
>>
>> As for improving the speed, it would basically come down to refactoring
`table` to not use a `factor` conversion. I'd be concerned about introducing
a lot of edge cases with that, but it's theoretically possible. Based on 30
seconds of thinking, it may be possible to do something like:
>>
>> ## just a sketch of a barebones non-factor implementation
>>     test_tab <- function(x){
>>       lookup <- unique(x)
>>       counts <- tabulate(match(x, lookup))
>>       names(counts) <- as.character(lookup)
>>       counts
>>     }
>>
>>     system.time(test_tab(L_v))  # 0.101   0.006   0.107
>>     system.time(test_tab(N_v))  # 0.129   0.015   0.144
>>
>> This is also faster in the case where there are lots of categories with
few entries per category:
>>
>>     N_v2 <- 1:1e7
>>     system.time(test_tab(N_v2)) # 0.383   0.024   0.411
>>     system.time(table(N_v2))    # 6.122   0.228   6.398
>>
>> Obviously there are some big shortcomings:
>> - it's missing a lot of error checking etc. that the standard
`table` has
>> - it only works with 1D vectors
>> - NA handling isn't quite the same as `table` (though it would be
easy to adapt)
>>
>> Just including to potentially start discussion for optimization.
>>
>> For reference, the relevant section is in
src/library/base/R/table.R:L75-85
>>
>> -Aidan
>>
>> -----------------------
>> Aidan Lakshman (he/him)
>> http://www.ahl27.com/
>>
>> On 21 Mar 2025, at 8:26, Karolis Koncevi?ius wrote:
>>
>>> [You don't often get email from karolis.koncevicius using
gmail.com. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> I was calling table() on some long logical vectors and noticed that
it took a long time.
>>>
>>> Out of curiosity I checked the performance of table() on different
types, and had some unexpected results:
>>>
>>>       C <- sample(c("yes", "no"), 10^7,
replace = TRUE)
>>>       F <- factor(sample(c("yes", "no"),
10^7, replace = TRUE))
>>>       N <- sample(c(1,0), 10^7, replace = TRUE)
>>>       I <- sample(c(1L,0L), 10^7, replace = TRUE)
>>>       L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
>>>
>>>                              # ordered by execution time
>>>                              #   user  system elapsed
>>>       system.time(table(F))  #  0.088   0.006   0.093
>>>       system.time(table(C))  #  0.208   0.017   0.224
>>>       system.time(table(I))  #  0.242   0.019   0.261
>>>       system.time(table(L))  #  0.665   0.015   0.680
>>>       system.time(table(N))  #  1.771   0.019   1.791
>>>
>>>
>>> The performance for Integers and specially booleans is quite
surprising.
>>> After investigating the source of table, I ended up on the reason
being ?as.character()?:
>>>
>>>       system.time(as.character(L))
>>>        user  system elapsed
>>>       0.461   0.002   0.462
>>>
>>> Even a manual conversion can achieve a speed-up by a factor of ~7:
>>>
>>>       system.time(c("FALSE", "TRUE")[L+1])
>>>        user  system elapsed
>>>       0.061   0.006   0.067
>>>
>>>
>>> Tested on 4.4.3 as well as devel trunk.
>>>
>>> Just reporting for comments and attention.
>>> Karolis K.
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Apr 2025 - table() and as.character() performance for logical values

[Rd] table() and as.character() performance for logical values

[Rd] table() and as.character() performance for logical values