Karl Ove Hufthammer
2011-Jan-21 09:47 UTC
[Rd] match function causing bad performance when using table function on factors with multibyte characters on Windows
[I originally posted this on the R-help mailing list, and it was suggested that R-devel would be a better place to dicuss it.] Running ?table? on a factor with levels containing non-ASCII characters seems to result in extremely bad performance on Windows. Here?s a simple example with benchmark results (I?ve reduced the number of replications to make the function finish within reasonable time): library(rbenchmark) x.num=sample(1:2, 10^5, replace=TRUE) x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B")) x.fac.nascii=factor(x.num, levels=1:2, labels=c("?","?")) benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 ) test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 20 1.53 4.636364 1.51 0.01 NA NA 2 table(x.fac.ascii) 20 0.33 1.000000 0.33 0.00 NA NA 3 table(x.fac.nascii) 20 146.67 444.454545 38.52 81.74 NA NA 1 table(x.num) 20 1.55 4.696970 1.53 0.01 NA NA sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 LC_CTYPE=Norwegian-Nynorsk_Norway.1252 LC_MONETARY=Norwegian-Nynorsk_Norway.1252 [4] LC_NUMERIC=C LC_TIME=Norwegian-Nynorsk_Norway.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] rbenchmark_0.3 The timings are from R 2.12.1, but I also get comparable results on the latest prelease (R 2.13.0 2011-01-18 r54032). Running the same test (100 replications) on a Linux system with R.12.1 Patched results in essentially no difference between the performance on ASCII factors and non-ASCII factors: test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 100 4.607 3.096102 4.455 0.092 0 0 2 table(x.fac.ascii) 100 1.488 1.000000 1.459 0.028 0 0 3 table(x.fac.nascii) 100 1.616 1.086022 1.560 0.051 0 0 1 table(x.num) 100 4.504 3.026882 4.403 0.079 0 0 sessionInfo() R version 2.12.1 Patched (2011-01-18 r54033) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C LC_TIME=nn_NO.UTF-8 [4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8 [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rbenchmark_0.3 Profiling the ?table? function indicates almost all the time if spent in the ?match? function, which is used when ?factor? is used on a ?factor? inside ?table?. Indeed, ?x.fac.nascii = factor(x.fac.nascii)? by itself is extremely slow. Is there any theoretical reason ?factor? on ?factor? with non-ASCII characters must be so slow? And why doesn?t this happen on Linux? Perhaps a fix for ?table? might be calculating the ?table? statistics *including* all levels (not using the ?factor? function anywhere), and then removing the ?exclude? levels in the end. For example, something along these lines: res = table.modified.to.not.use.factor(...) ind = lapply(dimnames(res), function(x) !(x %in% exclude)) do.call("[", c(list(res), ind, drop=FALSE)) (I haven?t tested this very much, so there may be issues with this way of doing things.) -- Karl Ove Hufthammer
Matthew Dowle
2011-Jan-24 17:30 UTC
[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
I'm not sure, but note the difference in locale between Linux (UTF-8) and Windows (non UTF-8). As far as I understand it R much prefers UTF-8, which Windows doesn't natively support. Otherwise you could just change your Windows locale to a UTF-8 locale to make R happier. My stab in the dark would be that the poor performance on Windows in this case may be down to many calls to translateCharUTF8 internally. There was a change in R 2.12.0 in this area. Running your test in R 2.11.1 on Windows shows the same problem though so it doesn't look like that change caused this problem.>From NEWS 2.12.0 :o unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in 'unique.c' If anybody knows a way to trick R on Linux into thinking it has an encoding similar to Windows then I may be able to take a look if I can reproduce the problem in Linux. Matthew "Karl Ove Hufthammer" <karl at huftis.org> wrote in message news:ihbko3$efs$1 at dough.gmane.org...> [I originally posted this on the R-help mailing list, and it was suggested > that R-devel would be a better > place to dicuss it.] > > Running 'table' on a factor with levels containing non-ASCII characters > seems to result in extremely bad performance on Windows. Here's a simple > example with benchmark results (I've reduced the number of replications to > make the function finish within reasonable time): > > library(rbenchmark) > x.num=sample(1:2, 10^5, replace=TRUE) > x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B")) > x.fac.nascii=factor(x.num, levels=1:2, labels=c("?","?")) > benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), > table(unclass(x.fac.nascii)), replications=20 ) > > test replications elapsed relative user.self > sys.self user.child sys.child > 4 table(unclass(x.fac.nascii)) 20 1.53 4.636364 1.51 > 0.01 NA NA > 2 table(x.fac.ascii) 20 0.33 1.000000 0.33 > 0.00 NA NA > 3 table(x.fac.nascii) 20 146.67 444.454545 38.52 > 81.74 NA NA > 1 table(x.num) 20 1.55 4.696970 1.53 > 0.01 NA NA > > sessionInfo() > R version 2.12.1 (2010-12-16) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 > LC_CTYPE=Norwegian-Nynorsk_Norway.1252 > LC_MONETARY=Norwegian-Nynorsk_Norway.1252 > [4] LC_NUMERIC=C > LC_TIME=Norwegian-Nynorsk_Norway.1252 > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] rbenchmark_0.3 > > The timings are from R 2.12.1, but I also get comparable results > on the latest prelease (R 2.13.0 2011-01-18 r54032). > > Running the same test (100 replications) on a Linux system with > R.12.1 Patched results in essentially no difference between the > performance on ASCII factors and non-ASCII factors: > > test replications elapsed relative user.self > sys.self user.child sys.child > 4 table(unclass(x.fac.nascii)) 100 4.607 3.096102 4.455 > 0.092 0 0 > 2 table(x.fac.ascii) 100 1.488 1.000000 1.459 > 0.028 0 0 > 3 table(x.fac.nascii) 100 1.616 1.086022 1.560 > 0.051 0 0 > 1 table(x.num) 100 4.504 3.026882 4.403 > 0.079 0 0 > > sessionInfo() > R version 2.12.1 Patched (2011-01-18 r54033) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C > LC_TIME=nn_NO.UTF-8 > [4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C > LC_MESSAGES=nn_NO.UTF-8 > [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] rbenchmark_0.3 > > Profiling the 'table' function indicates almost all the time if spent in > the 'match' function, which is used when 'factor' is used on a 'factor' > inside 'table'. Indeed, 'x.fac.nascii = factor(x.fac.nascii)' by itself > is extremely slow. > > Is there any theoretical reason 'factor' on 'factor' with non-ASCII > characters must be so slow? And why doesn't this happen on Linux? > > Perhaps a fix for 'table' might be calculating the 'table' statistics > *including* all levels (not using the 'factor' function anywhere), > and then removing the 'exclude' levels in the end. For example, > something along these lines: > > res = table.modified.to.not.use.factor(...) > ind = lapply(dimnames(res), function(x) !(x %in% exclude)) > do.call("[", c(list(res), ind, drop=FALSE)) > > (I haven't tested this very much, so there may be issues with this > way of doing things.) > > -- > Karl Ove Hufthammer > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Maybe Matching Threads
- R on Windows crashes when using certain characters in strings in data frames (PR#14125)
- Parameter scaling problems with optim and Nelder-Mead method (bug?)
- plot.POSIXct uses wrong x axis (PR#14016)
- R on Windows crashes when using certain characters in strings (PR#14137)
- Kerning issues with CairoPDF