thr3ads.net - R devel - [Rd] match function causing bad performance when using table function on factors with multibyte characters on Windows [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Karl Ove Hufthammer

2011-Jan-21 09:47 UTC

[Rd] match function causing bad performance when using table function on factors with multibyte characters on Windows

[I originally posted this on the R-help mailing list, and it was suggested that
R-devel would be a better
place to dicuss it.]

Running ?table? on a factor with levels containing non-ASCII characters
seems to result in extremely bad performance on Windows. Here?s a simple
example with benchmark results (I?ve reduced the number of replications to
make the function finish within reasonable time):

  library(rbenchmark)
  x.num=sample(1:2, 10^5, replace=TRUE)
  x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B"))
  x.fac.nascii=factor(x.num, levels=1:2, labels=c("?","?"))
  benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii),
table(unclass(x.fac.nascii)), replications=20 )
  
                            test replications elapsed   relative user.self
sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))           20    1.53   4.636364      1.51    
0.01         NA        NA
  2           table(x.fac.ascii)           20    0.33   1.000000      0.33    
0.00         NA        NA
  3          table(x.fac.nascii)           20  146.67 444.454545     38.52   
81.74         NA        NA
  1                 table(x.num)           20    1.55   4.696970      1.53    
0.01         NA        NA
  
  sessionInfo()
  R version 2.12.1 (2010-12-16)
  Platform: i386-pc-mingw32/i386 (32-bit)
  
  locale:
  [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
LC_CTYPE=Norwegian-Nynorsk_Norway.1252   
LC_MONETARY=Norwegian-Nynorsk_Norway.1252
  [4] LC_NUMERIC=C                             
LC_TIME=Norwegian-Nynorsk_Norway.1252
  
  attached base packages:
  [1] stats     graphics  grDevices datasets  utils     methods   base    
  
  other attached packages:
  [1] rbenchmark_0.3

The timings are from R 2.12.1, but I also get comparable results
on the latest prelease (R 2.13.0 2011-01-18 r54032).

Running the same test (100 replications) on a Linux system with
R.12.1 Patched results in essentially no difference between the
performance on ASCII factors and non-ASCII factors:

                            test replications elapsed relative user.self
sys.self user.child sys.child
  4 table(unclass(x.fac.nascii))          100   4.607 3.096102     4.455   
0.092          0         0
  2           table(x.fac.ascii)          100   1.488 1.000000     1.459   
0.028          0         0
  3          table(x.fac.nascii)          100   1.616 1.086022     1.560   
0.051          0         0
  1                 table(x.num)          100   4.504 3.026882     4.403   
0.079          0         0

  sessionInfo()
  R version 2.12.1 Patched (2011-01-18 r54033)
  Platform: i686-pc-linux-gnu (32-bit)
  
  locale:
   [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C               LC_TIME=nn_NO.UTF-8
   [4] LC_COLLATE=nn_NO.UTF-8     LC_MONETARY=C             
LC_MESSAGES=nn_NO.UTF-8
   [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C                  LC_ADDRESS=C
  [10] LC_TELEPHONE=C             LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C
  
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

  other attached packages:
  [1] rbenchmark_0.3

Profiling the ?table? function indicates almost all the time if spent in
the ?match? function, which is used when ?factor? is used on a ?factor?
inside ?table?. Indeed, ?x.fac.nascii = factor(x.fac.nascii)? by itself
is extremely slow.

Is there any theoretical reason ?factor? on ?factor? with non-ASCII
characters must be so slow? And why doesn?t this happen on Linux?

Perhaps a fix for ?table? might be calculating the ?table? statistics
*including* all levels (not using the ?factor? function anywhere),
and then removing the ?exclude? levels in the end. For example,
something along these lines:

res = table.modified.to.not.use.factor(...)
ind = lapply(dimnames(res), function(x) !(x %in% exclude))
do.call("[", c(list(res), ind, drop=FALSE))

(I haven?t tested this very much, so there may be issues with this
way of doing things.)

-- 
Karl Ove Hufthammer

Matthew Dowle

2011-Jan-24 17:30 UTC

head link

[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

I'm not sure, but note the difference in locale between
Linux (UTF-8) and Windows (non UTF-8). As far as I
understand it R much prefers UTF-8, which Windows doesn't
natively support. Otherwise you could just change your
Windows locale to a UTF-8 locale to make R happier.

My stab in the dark would be that the poor performance on
Windows in this case may be down to many calls to
translateCharUTF8 internally.

There was a change in R 2.12.0 in this area. Running your
test in R 2.11.1 on Windows shows the same problem though
so it doesn't look like that change caused this problem.
>From NEWS 2.12.0 :o  unique() and match() are now faster on character vectors
    where all elements are in the global CHARSXP cache and
    have unmarked encoding (ASCII). Thanks to Matthew
    Dowle for suggesting improvements to the way the hash
    code is generated in 'unique.c'

If anybody knows a way to trick R on Linux into thinking it has
an encoding similar to Windows then I may be able to take a
look if I can reproduce the problem in Linux.

Matthew


"Karl Ove Hufthammer" <karl at huftis.org> wrote in message 
news:ihbko3$efs$1 at dough.gmane.org...> [I originally posted this on the R-help mailing list, and it was suggested 
> that R-devel would be a better
> place to dicuss it.]
>
> Running 'table' on a factor with levels containing non-ASCII
characters
> seems to result in extremely bad performance on Windows. Here's a
simple
> example with benchmark results (I've reduced the number of replications
to
> make the function finish within reasonable time):
>
>  library(rbenchmark)
>  x.num=sample(1:2, 10^5, replace=TRUE)
>  x.fac.ascii=factor(x.num, levels=1:2,
labels=c("A","B"))
>  x.fac.nascii=factor(x.num, levels=1:2,
labels=c("?","?"))
>  benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), 
> table(unclass(x.fac.nascii)), replications=20 )
>
>                            test replications elapsed   relative user.self 
> sys.self user.child sys.child
>  4 table(unclass(x.fac.nascii))           20    1.53   4.636364      1.51 
> 0.01         NA        NA
>  2           table(x.fac.ascii)           20    0.33   1.000000      0.33 
> 0.00         NA        NA
>  3          table(x.fac.nascii)           20  146.67 444.454545     38.52 
> 81.74         NA        NA
>  1                 table(x.num)           20    1.55   4.696970      1.53 
> 0.01         NA        NA
>
>  sessionInfo()
>  R version 2.12.1 (2010-12-16)
>  Platform: i386-pc-mingw32/i386 (32-bit)
>
>  locale:
>  [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
> LC_CTYPE=Norwegian-Nynorsk_Norway.1252 
> LC_MONETARY=Norwegian-Nynorsk_Norway.1252
>  [4] LC_NUMERIC=C 
> LC_TIME=Norwegian-Nynorsk_Norway.1252
>
>  attached base packages:
>  [1] stats     graphics  grDevices datasets  utils     methods   base
>
>  other attached packages:
>  [1] rbenchmark_0.3
>
> The timings are from R 2.12.1, but I also get comparable results
> on the latest prelease (R 2.13.0 2011-01-18 r54032).
>
> Running the same test (100 replications) on a Linux system with
> R.12.1 Patched results in essentially no difference between the
> performance on ASCII factors and non-ASCII factors:
>
>                            test replications elapsed relative user.self 
> sys.self user.child sys.child
>  4 table(unclass(x.fac.nascii))          100   4.607 3.096102     4.455 
> 0.092          0         0
>  2           table(x.fac.ascii)          100   1.488 1.000000     1.459 
> 0.028          0         0
>  3          table(x.fac.nascii)          100   1.616 1.086022     1.560 
> 0.051          0         0
>  1                 table(x.num)          100   4.504 3.026882     4.403 
> 0.079          0         0
>
>  sessionInfo()
>  R version 2.12.1 Patched (2011-01-18 r54033)
>  Platform: i686-pc-linux-gnu (32-bit)
>
>  locale:
>   [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C 
> LC_TIME=nn_NO.UTF-8
>   [4] LC_COLLATE=nn_NO.UTF-8     LC_MONETARY=C 
> LC_MESSAGES=nn_NO.UTF-8
>   [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C                  LC_ADDRESS=C
>  [10] LC_TELEPHONE=C             LC_MEASUREMENT=nn_NO.UTF-8 
> LC_IDENTIFICATION=C
>
>  attached base packages:
>  [1] stats     graphics  grDevices utils     datasets  methods   base
>
>  other attached packages:
>  [1] rbenchmark_0.3
>
> Profiling the 'table' function indicates almost all the time if
spent in
> the 'match' function, which is used when 'factor' is used
on a 'factor'
> inside 'table'. Indeed, 'x.fac.nascii =
factor(x.fac.nascii)' by itself
> is extremely slow.
>
> Is there any theoretical reason 'factor' on 'factor' with
non-ASCII
> characters must be so slow? And why doesn't this happen on Linux?
>
> Perhaps a fix for 'table' might be calculating the 'table'
statistics
> *including* all levels (not using the 'factor' function anywhere),
> and then removing the 'exclude' levels in the end. For example,
> something along these lines:
>
> res = table.modified.to.not.use.factor(...)
> ind = lapply(dimnames(res), function(x) !(x %in% exclude))
> do.call("[", c(list(res), ind, drop=FALSE))
>
> (I haven't tested this very much, so there may be issues with this
> way of doing things.)
>
> -- 
> Karl Ove Hufthammer
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - Jan 2011 - match function causing bad performance when using table function on factors with multibyte characters on Windows

[Rd] match function causing bad performance when using table function on factors with multibyte characters on Windows

[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

Possibly Parallel Threads