thr3ads.net - R help - [R] table on factors with non-ASCII characters *extremely* slow on Windows [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Karl Ove Hufthammer

2011-Jan-19 10:38 UTC

[R] table on factors with non-ASCII characters extremely slow on Windows

Running ?table? on a factor with levels containing non-ASCII characters
seems to result in *extremely* bad performance on Windows. Here?s a simple
example with benchmark results (I?ve reduced the number of replications to
make the function finish within reasonable time):

  library(rbenchmark)
  x.num=sample(1:2, 10^5, replace=TRUE)
  x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B"))
  x.fac.nascii=factor(x.num, levels=1:2, labels=c("?","?"))
  benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii),
table(unclass(x.fac.nascii)), replications=20 )
  
                            test replications elapsed   relative user.self
sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))           20    1.53   4.636364      1.51    
0.01         NA        NA
  2           table(x.fac.ascii)           20    0.33   1.000000      0.33    
0.00         NA        NA
  3          table(x.fac.nascii)           20  146.67 444.454545     38.52   
81.74         NA        NA
  1                 table(x.num)           20    1.55   4.696970      1.53    
0.01         NA        NA
  
  sessionInfo()
  R version 2.12.1 (2010-12-16)
  Platform: i386-pc-mingw32/i386 (32-bit)
  
  locale:
  [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
LC_CTYPE=Norwegian-Nynorsk_Norway.1252   
LC_MONETARY=Norwegian-Nynorsk_Norway.1252
  [4] LC_NUMERIC=C                             
LC_TIME=Norwegian-Nynorsk_Norway.1252
  
  attached base packages:
  [1] stats     graphics  grDevices datasets  utils     methods   base    
  
  other attached packages:
  [1] rbenchmark_0.3

Running the same test (but 100 replications) on a Linux system with
R.12.1 Patched results in no difference between the performance on
ASCII factors and non-ASCII factors:

                            test replications elapsed relative user.self
sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))          100   4.607 3.096102     4.455   
0.092          0         0
  2           table(x.fac.ascii)          100   1.488 1.000000     1.459   
0.028          0         0
  3          table(x.fac.nascii)          100   1.616 1.086022     1.560   
0.051          0         0
  1                 table(x.num)          100   4.504 3.026882     4.403   
0.079          0         0

  sessionInfo()
  R version 2.12.1 Patched (2011-01-18 r54033)
  Platform: i686-pc-linux-gnu (32-bit)
  
  locale:
   [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C               LC_TIME=nn_NO.UTF-8
   [4] LC_COLLATE=nn_NO.UTF-8     LC_MONETARY=C             
LC_MESSAGES=nn_NO.UTF-8
   [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C                  LC_ADDRESS=C
  [10] LC_TELEPHONE=C             LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C
  
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

  other attached packages:
  [1] rbenchmark_0.3

Can anyone else reproduce this? I see no theoretical reason why the level
names of the factor should matter, as the factors are internally stored
as numeric values (cf. the results of ?table(unclass(x.fac.nascii))?), and
the level names are only needed when displaying the results.

BTW, ?tabulate? on x.fac.nascii is extremely fast, on both Windows and
Linux. I guess at least for simple cases one could use something like

  res=tabulate(x.fac.nascii, nbins=nlevels(x.fac.nascii))
  names(res)=levels(x.fac.nascii)

though I?m not entirely sure the internal structure of factors is
guaranteed to be so that this will always work.

Any comments or suggestions?

-- 
Karl Ove Hufthammer

R help - Jan 2011 - table on factors with non-ASCII characters *extremely* slow on Windows

[R] table on factors with non-ASCII characters *extremely* slow on Windows

R help - Jan 2011 - table on factors with non-ASCII characters extremely slow on Windows

[R] table on factors with non-ASCII characters extremely slow on Windows