raphael.felber at agroscope.admin.ch
2017-Aug-22 12:53 UTC
[R] How to benchmark speed of load/readRDS correctly
Dear all I was thinking about efficient reading data into R and tried several ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata and file.rds contain the same data, the first created with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', compress=F). First I used the function microbenchmark() and was a astonished about the max value of the output. FIRST TEST:> library(microbenchmark) > microbenchmark(+ n <- readRDS('file.rds'), + load('file.Rdata') + ) Unit: milliseconds expr min lq mean median uq max neval n <- readRDS(fl1) 106.5956 109.6457 237.3844 117.8956 141.9921 10934.162 100 load(fl2) 295.0654 301.8162 335.6266 308.3757 319.6965 1915.706 100 It looks like the max value is an outlier. So I tried: SECOND TEST:> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed 10.50 0.11 0.11 0.11 0.10 0.11 0.11 0.11 0.12 0.12> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed 1.86 0.29 0.31 0.30 0.30 0.31 0.30 0.29 0.31 0.30 Which confirmed my suspicion; the first time loading the data takes much longer than the following times. I suspect that this has something to do how the data is assigned and that R doesn't has to 'fully' read the data, if it is read the second time. So the question remains, how can I make a realistic benchmark test? From the first test I would conclude that reading the *.rds file is faster. But this holds only for a large number of neval. If I set times = 1 then reading the *.Rdata would be faster (as also indicated by the second test). Thanks for any help or comments. Kind regards Raphael ------------------------------------------------------------------------------------ Raphael Felber, PhD Scientific Officer, Climate & Air Pollution Federal Department of Economic Affairs, Education and Research EAER Agroscope Research Division, Agroecology and Environment Reckenholzstrasse 191, CH-8046 Z?rich Phone +41 58 468 75 11 Fax +41 58 468 72 01 raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch> www.agroscope.ch<http://www.agroscope.ch/> [[alternative HTML version deleted]]
You need to study how reading files works in your operating system. This question is not about R. -- Sent from my phone. Please excuse my brevity. On August 22, 2017 5:53:09 AM PDT, raphael.felber at agroscope.admin.ch wrote:>Dear all > >I was thinking about efficient reading data into R and tried several >ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The >files file.Rdata and file.rds contain the same data, the first created >with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, >' file.rds', compress=F). > >First I used the function microbenchmark() and was a astonished about >the max value of the output. > >FIRST TEST: >> library(microbenchmark) >> microbenchmark( >+ n <- readRDS('file.rds'), >+ load('file.Rdata') >+ ) >Unit: milliseconds >expr min lq >mean median uq > max neval >n <- readRDS(fl1) 106.5956 109.6457 237.3844 > 117.8956 141.9921 10934.162 100 >load(fl2) 295.0654 301.8162 335.6266 > 308.3757 319.6965 1915.706 100 > >It looks like the max value is an outlier. > >So I tried: >SECOND TEST: >> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3]) >elapsed elapsed elapsed >elapsed elapsed elapsed >elapsed elapsed elapsed >elapsed >10.50 0.11 0.11 >0.11 0.10 0.11 >0.11 0.11 0.12 > 0.12 >> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3]) >elapsed elapsed elapsed >elapsed elapsed elapsed >elapsed elapsed elapsed >elapsed >1.86 0.29 0.31 >0.30 0.30 0.31 >0.30 0.29 0.31 > 0.30 > >Which confirmed my suspicion; the first time loading the data takes >much longer than the following times. I suspect that this has something >to do how the data is assigned and that R doesn't has to 'fully' read >the data, if it is read the second time. > >So the question remains, how can I make a realistic benchmark test? >From the first test I would conclude that reading the *.rds file is >faster. But this holds only for a large number of neval. If I set times >= 1 then reading the *.Rdata would be faster (as also indicated by the >second test). > >Thanks for any help or comments. > >Kind regards > >Raphael >------------------------------------------------------------------------------------ >Raphael Felber, PhD >Scientific Officer, Climate & Air Pollution > >Federal Department of Economic Affairs, >Education and Research EAER >Agroscope >Research Division, Agroecology and Environment > >Reckenholzstrasse 191, CH-8046 Z?rich >Phone +41 58 468 75 11 >Fax +41 58 468 72 01 >raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch> >www.agroscope.ch<http://www.agroscope.ch/> > > > [[alternative HTML version deleted]]
Not convinced Jeff is completely right about this not concerning R, since I've found that the application language (R, perl, etc.) makes a difference in how files are accessed by/to OS. He is certainly correct that OS (and versions) are where the actual reading and writing happens, but sometimes the call to those can be inefficient. (Sorry, I've not got examples specifically for file reads, but had a case in computation where there was an 800% i.e., 80000 fold difference in timing with R, which rather took my breath away. That's probably been sorted now.) The difficulty in making general statements is that a rather full set of comparisons over different commands, datasets, OS and version variants is needed before the general picture can emerge. Using microbenchmark when you need to find the bottlenecks is how I'd proceed, which OP is doing. About 30 years ago, I did write up some preliminary work, never published, on estimating the two halves of a copy, that is, the reading from file and storing to "memory" or a different storage location. This was via regression with a singular design matrix, but one can get a minimal length least squares solution via svd. Possibly relevant today to try to get at slow links on a network. JN On 2017-08-22 09:07 AM, Jeff Newmiller wrote:> You need to study how reading files works in your operating system. This question is not about R. >
The large value for maximum time may be due to garbage collection, which happens periodically. E.g., try the following, where the unlist(as.list()) creates a lot of garbage. I get a very large time every 102 or 51 iterations and a moderately large time more often mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000) plot(mb$time) quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1)) # 0% 50% 75% 90% 95% 99% 100% # 59.04446 82.15453 102.17522 180.36986 187.52667 233.42062 249.33970 diff(which(mb$time > quantile(mb$time, .99))) # [1] 102 51 102 102 102 102 102 102 51 diff(which(mb$time > quantile(mb$time, .95))) # [1] 6 41 4 47 4 40 7 4 47 4 33 14 4 47 4 47 4 47 4 47 4 47 4 6 41 #[26] 4 6 7 9 25 4 47 4 47 4 47 4 22 25 4 33 14 4 6 41 4 47 4 22 Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch> wrote:> Dear all > > I was thinking about efficient reading data into R and tried several ways > to test if load(file.Rdata) or readRDS(file.rds) is faster. The files > file.Rdata and file.rds contain the same data, the first created with > save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' > file.rds', compress=F). > > First I used the function microbenchmark() and was a astonished about the > max value of the output. > > FIRST TEST: > > library(microbenchmark) > > microbenchmark( > + n <- readRDS('file.rds'), > + load('file.Rdata') > + ) > Unit: milliseconds > expr min lq > mean median uq > max neval > n <- readRDS(fl1) 106.5956 109.6457 237.3844 > 117.8956 141.9921 10934.162 100 > load(fl2) 295.0654 301.8162 > 335.6266 308.3757 319.6965 1915.706 > 100 > > It looks like the max value is an outlier. > > So I tried: > SECOND TEST: > > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3]) > elapsed elapsed elapsed elapsed > elapsed elapsed elapsed > elapsed elapsed elapsed > 10.50 0.11 0.11 > 0.11 0.10 0.11 > 0.11 0.11 0.12 > 0.12 > > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3]) > elapsed elapsed elapsed elapsed > elapsed elapsed elapsed > elapsed elapsed elapsed > 1.86 0.29 0.31 > 0.30 0.30 0.31 > 0.30 0.29 0.31 > 0.30 > > Which confirmed my suspicion; the first time loading the data takes much > longer than the following times. I suspect that this has something to do > how the data is assigned and that R doesn't has to 'fully' read the data, > if it is read the second time. > > So the question remains, how can I make a realistic benchmark test? From > the first test I would conclude that reading the *.rds file is faster. But > this holds only for a large number of neval. If I set times = 1 then > reading the *.Rdata would be faster (as also indicated by the second test). > > Thanks for any help or comments. > > Kind regards > > Raphael > ------------------------------------------------------------ > ------------------------ > Raphael Felber, PhD > Scientific Officer, Climate & Air Pollution > > Federal Department of Economic Affairs, > Education and Research EAER > Agroscope > Research Division, Agroecology and Environment > > Reckenholzstrasse 191, CH-8046 Z?rich > Phone +41 58 468 75 11 > Fax +41 58 468 72 01 > raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch > > > www.agroscope.ch<http://www.agroscope.ch/> > > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Note that if you force a garbage collection each iteration the times are more stable. However, on the average it is faster to let the garbage collector decide when to leap into action. mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000, control=list(order="inorder")) with(mb_gc, plot(time[expr!="gc()"])) with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99, 1))) # 0% 50% 75% 90% 95% 99% 100% # 59.33450 61.33954 63.43457 66.23331 68.93746 74.45629 158.09799 Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <wdunlap at tibco.com> wrote:> The large value for maximum time may be due to garbage collection, which > happens periodically. E.g., try the following, where the > unlist(as.list()) creates a lot of garbage. I get a very large time every > 102 or 51 iterations and a moderately large time more often > > mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- > unlist(x) / cos(1:5e5) ; sum(x) }, times=1000) > plot(mb$time) > quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1)) > # 0% 50% 75% 90% 95% 99% 100% > # 59.04446 82.15453 102.17522 180.36986 187.52667 233.42062 249.33970 > diff(which(mb$time > quantile(mb$time, .99))) > # [1] 102 51 102 102 102 102 102 102 51 > diff(which(mb$time > quantile(mb$time, .95))) > # [1] 6 41 4 47 4 40 7 4 47 4 33 14 4 47 4 47 4 47 4 47 4 47 4 > 6 41 > #[26] 4 6 7 9 25 4 47 4 47 4 47 4 22 25 4 33 14 4 6 41 4 47 4 > 22 > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch> > wrote: > >> Dear all >> >> I was thinking about efficient reading data into R and tried several ways >> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files >> file.Rdata and file.rds contain the same data, the first created with >> save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' >> file.rds', compress=F). >> >> First I used the function microbenchmark() and was a astonished about the >> max value of the output. >> >> FIRST TEST: >> > library(microbenchmark) >> > microbenchmark( >> + n <- readRDS('file.rds'), >> + load('file.Rdata') >> + ) >> Unit: milliseconds >> expr min lq >> mean median uq >> max neval >> n <- readRDS(fl1) 106.5956 109.6457 237.3844 >> 117.8956 141.9921 10934.162 100 >> load(fl2) 295.0654 301.8162 >> 335.6266 308.3757 319.6965 1915.706 >> 100 >> >> It looks like the max value is an outlier. >> >> So I tried: >> SECOND TEST: >> > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3]) >> elapsed elapsed elapsed >> elapsed elapsed elapsed elapsed >> elapsed elapsed elapsed >> 10.50 0.11 0.11 >> 0.11 0.10 0.11 >> 0.11 0.11 0.12 >> 0.12 >> > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3]) >> elapsed elapsed elapsed >> elapsed elapsed elapsed elapsed >> elapsed elapsed elapsed >> 1.86 0.29 0.31 >> 0.30 0.30 0.31 >> 0.30 0.29 0.31 >> 0.30 >> >> Which confirmed my suspicion; the first time loading the data takes much >> longer than the following times. I suspect that this has something to do >> how the data is assigned and that R doesn't has to 'fully' read the data, >> if it is read the second time. >> >> So the question remains, how can I make a realistic benchmark test? From >> the first test I would conclude that reading the *.rds file is faster. But >> this holds only for a large number of neval. If I set times = 1 then >> reading the *.Rdata would be faster (as also indicated by the second test). >> >> Thanks for any help or comments. >> >> Kind regards >> >> Raphael >> ------------------------------------------------------------ >> ------------------------ >> Raphael Felber, PhD >> Scientific Officer, Climate & Air Pollution >> >> Federal Department of Economic Affairs, >> Education and Research EAER >> Agroscope >> Research Division, Agroecology and Environment >> >> Reckenholzstrasse 191, CH-8046 Z?rich >> Phone +41 58 468 75 11 <+41%2058%20468%2075%2011> >> Fax +41 58 468 72 01 <+41%2058%20468%2072%2001> >> raphael.felber at agroscope.admin.ch<mailto:raphael.felber@ >> agroscope.admin.ch> >> www.agroscope.ch<http://www.agroscope.ch/> >> >> >> [[alternative HTML version deleted]] >> >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >[[alternative HTML version deleted]]