Karl Ove Hufthammer
2011-Jan-19 10:38 UTC
[R] table on factors with non-ASCII characters *extremely* slow on Windows
Running ?table? on a factor with levels containing non-ASCII characters seems to result in *extremely* bad performance on Windows. Here?s a simple example with benchmark results (I?ve reduced the number of replications to make the function finish within reasonable time): library(rbenchmark) x.num=sample(1:2, 10^5, replace=TRUE) x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B")) x.fac.nascii=factor(x.num, levels=1:2, labels=c("?","?")) benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 ) test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 20 1.53 4.636364 1.51 0.01 NA NA 2 table(x.fac.ascii) 20 0.33 1.000000 0.33 0.00 NA NA 3 table(x.fac.nascii) 20 146.67 444.454545 38.52 81.74 NA NA 1 table(x.num) 20 1.55 4.696970 1.53 0.01 NA NA sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 LC_CTYPE=Norwegian-Nynorsk_Norway.1252 LC_MONETARY=Norwegian-Nynorsk_Norway.1252 [4] LC_NUMERIC=C LC_TIME=Norwegian-Nynorsk_Norway.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] rbenchmark_0.3 Running the same test (but 100 replications) on a Linux system with R.12.1 Patched results in no difference between the performance on ASCII factors and non-ASCII factors: test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 100 4.607 3.096102 4.455 0.092 0 0 2 table(x.fac.ascii) 100 1.488 1.000000 1.459 0.028 0 0 3 table(x.fac.nascii) 100 1.616 1.086022 1.560 0.051 0 0 1 table(x.num) 100 4.504 3.026882 4.403 0.079 0 0 sessionInfo() R version 2.12.1 Patched (2011-01-18 r54033) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C LC_TIME=nn_NO.UTF-8 [4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8 [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rbenchmark_0.3 Can anyone else reproduce this? I see no theoretical reason why the level names of the factor should matter, as the factors are internally stored as numeric values (cf. the results of ?table(unclass(x.fac.nascii))?), and the level names are only needed when displaying the results. BTW, ?tabulate? on x.fac.nascii is extremely fast, on both Windows and Linux. I guess at least for simple cases one could use something like res=tabulate(x.fac.nascii, nbins=nlevels(x.fac.nascii)) names(res)=levels(x.fac.nascii) though I?m not entirely sure the internal structure of factors is guaranteed to be so that this will always work. Any comments or suggestions? -- Karl Ove Hufthammer