Brent
2016-Jul-19 13:40 UTC
[R] Has R recently made performance improvements in accumulation?
Subtitle: or, more likely, am I benchmarking wrong? I am new to R, but I have read that R is a hive of performance pitfalls. A very common case is if you try to accumulate results into a typical immutable R data structure. Exhibit A for this confusion is this StackOverflow question on an algorithm for O(1) performance in appending to a list: https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1 The original question was asked in 2010-03-13, and responses have trickled in for at least the next 5 years. Before mentioning my problem, I instead offer an example from someone vastly better than me in R, which I am sure should be free of mistakes. Consider this interesting article on efficient accumulation in R: http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/ In that article's first example, it claims that the "obvious ?for-loop? solution" (accumulation into a data.frame using rbind) is: is incredibly slow. ... we pay the cost of processing n*(n+1)/2 rows of data In other words, what looks like an O(n) algorithm is actually O(n^2). Sounds awful. But when I tried executing their example on my machine, I see O(n) NOT O(n^2) behavior! Here is the complete code that I executed: mkFrameForLoop <- function(nRow,nCol) { d <- c() for(i in seq_len(nRow)) { ri <- mkRow(nCol) di <- data.frame(ri, stringsAsFactors=FALSE) d <- rbind(d,di) } d } mkRow <- function(nCol) { x <- as.list(rnorm(nCol)) # make row mixed types by changing first column to string x[[1]] <- ifelse(x[[1]]>0,'pos','neg') names(x) <- paste('x',seq_len(nCol),sep='.') x } t1 = Sys.time() x = mkFrameForLoop(100, 10) tail(x) t2 = Sys.time() t2 - t1 t1 = Sys.time() x = mkFrameForLoop(200, 10) tail(x) t2 = Sys.time() t2 - t1 t1 = Sys.time() x = mkFrameForLoop(400, 10) tail(x) t2 = Sys.time() t2 - t1 And here is what I got for the execution times: Time difference of 0.113005876541138 secs Time difference of 0.205012083053589 secs Time difference of 0.390022993087769 secs That's a linear, not quadratic, increase in execution time as a function of nRow! It is NOT what I was expecting, altho it was nice to see. Notes: --the functions above are copy and pastes from the article --the execution time measurement code is all that I added --yes, that time measurement code is crude --I am aware of R benchmarking frameworks, in fact, I first tried using some of them --but when I got unexpected results, I went back to the simplest code possible, which is the above --it turns out that I get the same results regardless of how I measure --each measurement doubles the number of rows (from 100 to 200 to 400), which is what should be increasing to bring out rbind's allegedly bad behavior --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64 bit, using R 3.3.1 (latest release as of today) Given those unexpected results, you now understand the title of my post. So, has R's runtime somehow gotten greater intelligence recently so that it knows when it does not need to copy a data structure? Maybe in the case above, it does a quick shallow copy instead of a deep copy? Or, am I benchmarking wrong? What motivated my investigation is that I want to store a bunch of Monte Carlo simulation results in a numeric matrix. When I am finished with my code, I know that it will be a massive matrix (e.g. 1,000,000 or more rows). So I was concerned about performance, and went about benchmarking. Let me know if you want to see that benchmark code. I found that assignment to a matrix does not seem to generate a copy, even tho assignment is a mutating operation, so I was worried that it might. But when I investigated deeper, such as with the above code, I got worried that I cannot trust my results. I look forward to any feedback that you can give. I would especially appreciate any links that explain how you can determine what the R runtime actually does.
Thierry Onkelinx
2016-Jul-19 14:27 UTC
[R] Has R recently made performance improvements in accumulation?
Dear Brent, I can confirm your timings with library(microbenchmark) microbenchmark( mkFrameForLoop(100, 10), mkFrameForLoop(200, 10), mkFrameForLoop(400, 10) ) but profiling your code shown that rbind only uses a small fraction on the cpu time used by the function. profvis::profvis({mkFrameForLoop(100, 10)}) So I cleaned your example further into the function below. Now rbind is using most of cpu time. And the timings indicate an O(n^2) relation. minimal <- function(rows, cols){ x <- matrix(NA_integer_, ncol = cols, nrow = 0) for (i in seq_len(rows)){ x <- rbind(x, rep(i, 10)) } } profvis::profvis({minimal(1000, 100)}) timing <- microbenchmark( X50 = minimal(50, 50), X100 = minimal(100, 50), X200 = minimal(200, 50), X400 = minimal(400, 50), X800 = minimal(800, 50), X1600 = minimal(1600, 50) ) timing Unit: microseconds expr min lq mean median uq max neval cld X50 199.006 212.278 233.8444 235.728 247.3770 296.987 100 a X100 565.693 593.957 827.8733 618.835 640.1925 2950.139 100 a X200 1804.059 1876.390 2166.1106 1903.370 1938.7115 4263.967 100 a X400 6453.913 8755.848 8546.4339 8890.884 8961.7535 13259.024 100 a X800 30575.048 32913.186 36555.0118 33093.243 34620.5895 178740.765 100 b X1600 130976.429 133674.679 151494.6492 135197.087 137327.1235 292291.385 100 c timing$N <- as.integer(gsub("X", "", levels(timing$expr)))[timing$expr] model <- lm(time ~ poly(N, 4), data = timing) summary(model) Call: lm(formula = time ~ poly(N, 4), data = timing) Residuals: Min 1Q Median 3Q Max -20518162 -3378940 -130815 -45881 142183951 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33303987 843350 39.490 <2e-16 *** poly(N, 4)1 1286962014 20657783 62.299 <2e-16 *** poly(N, 4)2 338770077 20657783 16.399 <2e-16 *** poly(N, 4)3 222734 20657783 0.011 0.991 poly(N, 4)4 -2260902 20657783 -0.109 0.913 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 20660000 on 595 degrees of freedom Multiple R-squared: 0.8746, Adjusted R-squared: 0.8738 F-statistic: 1038 on 4 and 595 DF, p-value: < 2.2e-16 newdata <- data.frame(N = pretty(timing$N, 40)) newdata$time <- predict(model, newdata = newdata) plot(newdata$N, newdata$time, type = "l") plot(newdata$N, sqrt(newdata$time), type = "l") model2 <- lm(sqrt(time) ~ poly(N, 4), data = timing) summary(model2) Call: lm(formula = sqrt(time) ~ poly(N, 4), data = timing) Residuals: Min 1Q Median 3Q Max -756.3 -202.8 -54.7 -5.5 7416.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3980.36 33.13 120.160 < 2e-16 *** poly(N, 4)1 100395.40 811.41 123.730 < 2e-16 *** poly(N, 4)2 2191.34 811.41 2.701 0.00712 ** poly(N, 4)3 -803.54 811.41 -0.990 0.32243 poly(N, 4)4 82.09 811.41 0.101 0.91945 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 811.4 on 595 degrees of freedom Multiple R-squared: 0.9626, Adjusted R-squared: 0.9624 F-statistic: 3829 on 4 and 595 DF, p-value: < 2.2e-16 ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-07-19 15:40 GMT+02:00 Brent via R-help <r-help at r-project.org>:> Subtitle: or, more likely, am I benchmarking wrong? > > I am new to R, but I have read that R is a hive of performance pitfalls. > A very common case is if you try to accumulate results into a typical > immutable R data structure. > > Exhibit A for this confusion is this StackOverflow question on an > algorithm for O(1) performance in appending to a list: > > https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1 > The original question was asked in 2010-03-13, and responses have trickled > in for at least the next 5 years. > > Before mentioning my problem, I instead offer an example from someone > vastly better than me in R, which I am sure should be free of mistakes. > Consider this interesting article on efficient accumulation in R: > http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/ > > In that article's first example, it claims that the "obvious ?for-loop? > solution" (accumulation into a data.frame using rbind) is: > is incredibly slow. > ... > we pay the cost of processing n*(n+1)/2 rows of data > > In other words, what looks like an O(n) algorithm is actually O(n^2). > > Sounds awful. But when I tried executing their example on my machine, I > see O(n) NOT O(n^2) behavior! > > Here is the complete code that I executed: > > mkFrameForLoop <- function(nRow,nCol) { > d <- c() > for(i in seq_len(nRow)) { > ri <- mkRow(nCol) > di <- data.frame(ri, stringsAsFactors=FALSE) > d <- rbind(d,di) > } > d > } > > mkRow <- function(nCol) { > x <- as.list(rnorm(nCol)) > # make row mixed types by changing first column to string > x[[1]] <- ifelse(x[[1]]>0,'pos','neg') > names(x) <- paste('x',seq_len(nCol),sep='.') > x > } > > t1 = Sys.time() > x = mkFrameForLoop(100, 10) > tail(x) > t2 = Sys.time() > t2 - t1 > > t1 = Sys.time() > x = mkFrameForLoop(200, 10) > tail(x) > t2 = Sys.time() > t2 - t1 > > t1 = Sys.time() > x = mkFrameForLoop(400, 10) > tail(x) > t2 = Sys.time() > t2 - t1 > > And here is what I got for the execution times: > > Time difference of 0.113005876541138 secs > Time difference of 0.205012083053589 secs > Time difference of 0.390022993087769 secs > > That's a linear, not quadratic, increase in execution time as a function > of nRow! It is NOT what I was expecting, altho it was nice to see. > > Notes: > --the functions above are copy and pastes from the article > --the execution time measurement code is all that I added > --yes, that time measurement code is crude > --I am aware of R benchmarking frameworks, in fact, I first tried > using some of them > --but when I got unexpected results, I went back to the simplest > code possible, which is the above > --it turns out that I get the same results regardless of how I > measure > --each measurement doubles the number of rows (from 100 to 200 to > 400), which is what should be increasing to bring out rbind's allegedly bad > behavior > --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64 > bit, using R 3.3.1 (latest release as of today) > > Given those unexpected results, you now understand the title of my post. > So, has R's runtime somehow gotten greater intelligence recently so that it > knows when it does not need to copy a data structure? Maybe in the case > above, it does a quick shallow copy instead of a deep copy? Or, am I > benchmarking wrong? > > What motivated my investigation is that I want to store a bunch of Monte > Carlo simulation results in a numeric matrix. When I am finished with my > code, I know that it will be a massive matrix (e.g. 1,000,000 or more > rows). So I was concerned about performance, and went about benchmarking. > Let me know if you want to see that benchmark code. I found that > assignment to a matrix does not seem to generate a copy, even tho > assignment is a mutating operation, so I was worried that it might. But > when I investigated deeper, such as with the above code, I got worried that > I cannot trust my results. > > > I look forward to any feedback that you can give. > > I would especially appreciate any links that explain how you can determine > what the R runtime actually does. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
boB Rudis
2016-Jul-19 14:46 UTC
[R] Has R recently made performance improvements in accumulation?
Ideally, you would use a more functional programming approach: minimal <- function(rows, cols){ x <- matrix(NA_integer_, ncol = cols, nrow = 0) for (i in seq_len(rows)){ x <- rbind(x, rep(i, 10)) } x } minimaly <- function(rows, cols){ x <- matrix(NA_integer_, ncol = cols, nrow = 0) do.call(rbind, lapply(seq_len(rows), rep, cols)) } identical(minimal(100, 100), minimaly(100, 100)) # [1] TRUE microbenchmark( .for=minimal(100, 100), .lap=minimaly(100, 100) ) ## Unit: microseconds ## expr min lq mean median uq max neval cld ## .for 943.936 1062.3710 1416.1399 1120.259 1366.860 4655.322 100 b ## .lap 111.566 118.1945 160.9058 124.520 146.991 2862.391 100 a On Tue, Jul 19, 2016 at 10:27 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:> Dear Brent, > > I can confirm your timings with > > library(microbenchmark) > microbenchmark( > mkFrameForLoop(100, 10), > mkFrameForLoop(200, 10), > mkFrameForLoop(400, 10) > ) > > but profiling your code shown that rbind only uses a small fraction on the > cpu time used by the function. > > profvis::profvis({mkFrameForLoop(100, 10)}) > > So I cleaned your example further into the function below. Now rbind is > using most of cpu time. And the timings indicate an O(n^2) relation. > > minimal <- function(rows, cols){ > x <- matrix(NA_integer_, ncol = cols, nrow = 0) > for (i in seq_len(rows)){ > x <- rbind(x, rep(i, 10)) > } > } > > profvis::profvis({minimal(1000, 100)}) > > timing <- microbenchmark( > X50 = minimal(50, 50), > X100 = minimal(100, 50), > X200 = minimal(200, 50), > X400 = minimal(400, 50), > X800 = minimal(800, 50), > X1600 = minimal(1600, 50) > ) > timing > Unit: microseconds > expr min lq mean median uq max > neval cld > X50 199.006 212.278 233.8444 235.728 247.3770 296.987 > 100 a > X100 565.693 593.957 827.8733 618.835 640.1925 2950.139 > 100 a > X200 1804.059 1876.390 2166.1106 1903.370 1938.7115 4263.967 > 100 a > X400 6453.913 8755.848 8546.4339 8890.884 8961.7535 13259.024 > 100 a > X800 30575.048 32913.186 36555.0118 33093.243 34620.5895 178740.765 > 100 b > X1600 130976.429 133674.679 151494.6492 135197.087 137327.1235 292291.385 > 100 c > timing$N <- as.integer(gsub("X", "", levels(timing$expr)))[timing$expr] > model <- lm(time ~ poly(N, 4), data = timing) > summary(model) > > Call: > lm(formula = time ~ poly(N, 4), data = timing) > > Residuals: > Min 1Q Median 3Q Max > -20518162 -3378940 -130815 -45881 142183951 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 33303987 843350 39.490 <2e-16 *** > poly(N, 4)1 1286962014 20657783 62.299 <2e-16 *** > poly(N, 4)2 338770077 20657783 16.399 <2e-16 *** > poly(N, 4)3 222734 20657783 0.011 0.991 > poly(N, 4)4 -2260902 20657783 -0.109 0.913 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 20660000 on 595 degrees of freedom > Multiple R-squared: 0.8746, Adjusted R-squared: 0.8738 > F-statistic: 1038 on 4 and 595 DF, p-value: < 2.2e-16 > newdata <- data.frame(N = pretty(timing$N, 40)) > newdata$time <- predict(model, newdata = newdata) > plot(newdata$N, newdata$time, type = "l") > plot(newdata$N, sqrt(newdata$time), type = "l") > > model2 <- lm(sqrt(time) ~ poly(N, 4), data = timing) > summary(model2) > Call: > lm(formula = sqrt(time) ~ poly(N, 4), data = timing) > > Residuals: > Min 1Q Median 3Q Max > -756.3 -202.8 -54.7 -5.5 7416.5 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 3980.36 33.13 120.160 < 2e-16 *** > poly(N, 4)1 100395.40 811.41 123.730 < 2e-16 *** > poly(N, 4)2 2191.34 811.41 2.701 0.00712 ** > poly(N, 4)3 -803.54 811.41 -0.990 0.32243 > poly(N, 4)4 82.09 811.41 0.101 0.91945 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 811.4 on 595 degrees of freedom > Multiple R-squared: 0.9626, Adjusted R-squared: 0.9624 > F-statistic: 3829 on 4 and 595 DF, p-value: < 2.2e-16 > > > ir. Thierry Onkelinx > Instituut voor natuur- en bosonderzoek / Research Institute for Nature and > Forest > team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance > Kliniekstraat 25 > 1070 Anderlecht > Belgium > > To call in the statistician after the experiment is done may be no more > than asking him to perform a post-mortem examination: he may be able to say > what the experiment died of. ~ Sir Ronald Aylmer Fisher > The plural of anecdote is not data. ~ Roger Brinner > The combination of some data and an aching desire for an answer does not > ensure that a reasonable answer can be extracted from a given body of data. > ~ John Tukey > > 2016-07-19 15:40 GMT+02:00 Brent via R-help <r-help at r-project.org>: > >> Subtitle: or, more likely, am I benchmarking wrong? >> >> I am new to R, but I have read that R is a hive of performance pitfalls. >> A very common case is if you try to accumulate results into a typical >> immutable R data structure. >> >> Exhibit A for this confusion is this StackOverflow question on an >> algorithm for O(1) performance in appending to a list: >> >> https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1 >> The original question was asked in 2010-03-13, and responses have trickled >> in for at least the next 5 years. >> >> Before mentioning my problem, I instead offer an example from someone >> vastly better than me in R, which I am sure should be free of mistakes. >> Consider this interesting article on efficient accumulation in R: >> http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/ >> >> In that article's first example, it claims that the "obvious ?for-loop? >> solution" (accumulation into a data.frame using rbind) is: >> is incredibly slow. >> ... >> we pay the cost of processing n*(n+1)/2 rows of data >> >> In other words, what looks like an O(n) algorithm is actually O(n^2). >> >> Sounds awful. But when I tried executing their example on my machine, I >> see O(n) NOT O(n^2) behavior! >> >> Here is the complete code that I executed: >> >> mkFrameForLoop <- function(nRow,nCol) { >> d <- c() >> for(i in seq_len(nRow)) { >> ri <- mkRow(nCol) >> di <- data.frame(ri, stringsAsFactors=FALSE) >> d <- rbind(d,di) >> } >> d >> } >> >> mkRow <- function(nCol) { >> x <- as.list(rnorm(nCol)) >> # make row mixed types by changing first column to string >> x[[1]] <- ifelse(x[[1]]>0,'pos','neg') >> names(x) <- paste('x',seq_len(nCol),sep='.') >> x >> } >> >> t1 = Sys.time() >> x = mkFrameForLoop(100, 10) >> tail(x) >> t2 = Sys.time() >> t2 - t1 >> >> t1 = Sys.time() >> x = mkFrameForLoop(200, 10) >> tail(x) >> t2 = Sys.time() >> t2 - t1 >> >> t1 = Sys.time() >> x = mkFrameForLoop(400, 10) >> tail(x) >> t2 = Sys.time() >> t2 - t1 >> >> And here is what I got for the execution times: >> >> Time difference of 0.113005876541138 secs >> Time difference of 0.205012083053589 secs >> Time difference of 0.390022993087769 secs >> >> That's a linear, not quadratic, increase in execution time as a function >> of nRow! It is NOT what I was expecting, altho it was nice to see. >> >> Notes: >> --the functions above are copy and pastes from the article >> --the execution time measurement code is all that I added >> --yes, that time measurement code is crude >> --I am aware of R benchmarking frameworks, in fact, I first tried >> using some of them >> --but when I got unexpected results, I went back to the simplest >> code possible, which is the above >> --it turns out that I get the same results regardless of how I >> measure >> --each measurement doubles the number of rows (from 100 to 200 to >> 400), which is what should be increasing to bring out rbind's allegedly bad >> behavior >> --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64 >> bit, using R 3.3.1 (latest release as of today) >> >> Given those unexpected results, you now understand the title of my post. >> So, has R's runtime somehow gotten greater intelligence recently so that it >> knows when it does not need to copy a data structure? Maybe in the case >> above, it does a quick shallow copy instead of a deep copy? Or, am I >> benchmarking wrong? >> >> What motivated my investigation is that I want to store a bunch of Monte >> Carlo simulation results in a numeric matrix. When I am finished with my >> code, I know that it will be a massive matrix (e.g. 1,000,000 or more >> rows). So I was concerned about performance, and went about benchmarking. >> Let me know if you want to see that benchmark code. I found that >> assignment to a matrix does not seem to generate a copy, even tho >> assignment is a mutating operation, so I was worried that it might. But >> when I investigated deeper, such as with the above code, I got worried that >> I cannot trust my results. >> >> >> I look forward to any feedback that you can give. >> >> I would especially appreciate any links that explain how you can determine >> what the R runtime actually does. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.