thr3ads.net - R help - [R] How to benchmark speed of load/readRDS correctly [Aug 2017]

If this information is useful, please help other people find it:
Share via:

raphael.felber at agroscope.admin.ch

2017-Aug-22 12:53 UTC

[R] How to benchmark speed of load/readRDS correctly

Dear all

I was thinking about efficient reading data into R and tried several ways to
test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata
and file.rds contain the same data, the first created with save(d, '
file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds',
compress=F).

First I used the function microbenchmark() and was a astonished about the max
value of the output.

FIRST TEST:> library(microbenchmark)
> microbenchmark(+   n <- readRDS('file.rds'),
+   load('file.Rdata')
+ )
Unit: milliseconds
              expr                     min                lq                    
mean                    median                uq                           max  
neval
n <- readRDS(fl1)        106.5956      109.6457         237.3844             
117.8956              141.9921              10934.162           100
         load(fl2)                  295.0654      301.8162        335.6266      
308.3757              319.6965              1915.706              100

It looks like the max value is an outlier.

So I tried:
SECOND TEST:> sapply(1:10, function(x) system.time(n <-
readRDS('file.rds'))[3])elapsed               elapsed               elapsed               elapsed       
elapsed               elapsed               elapsed               elapsed       
elapsed               elapsed
  10.50                   0.11                       0.11                      
0.11                       0.10                       0.11                      
0.11                       0.11                       0.12                      
0.12> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])elapsed               elapsed               elapsed               elapsed       
elapsed               elapsed               elapsed               elapsed       
elapsed               elapsed
   1.86                    0.29                       0.31                      
0.30                       0.30                       0.31                      
0.30                       0.29                       0.31                      
0.30

Which confirmed my suspicion; the first time loading the data takes much longer
than the following times. I suspect that this has something to do how the data
is assigned and that R doesn't has to 'fully' read the data, if it
is read the second time.

So the question remains, how can I make a realistic benchmark test? From the
first test I would conclude that reading the *.rds file is faster. But this
holds only for a large number of neval. If I set times = 1 then reading the
*.Rdata would be faster (as also indicated by the second test).

Thanks for any help or comments.

Kind regards

Raphael
------------------------------------------------------------------------------------
Raphael Felber, PhD
Scientific Officer, Climate & Air Pollution

Federal Department of Economic Affairs,
Education and Research EAER
Agroscope
Research Division, Agroecology and Environment

Reckenholzstrasse 191, CH-8046 Z?rich
Phone +41 58 468 75 11
Fax     +41 58 468 72 01
raphael.felber at agroscope.admin.ch<mailto:raphael.felber at
agroscope.admin.ch>
www.agroscope.ch<http://www.agroscope.ch/>


	[[alternative HTML version deleted]]

Jeff Newmiller

2017-Aug-22 13:07 UTC

head link

[R] How to benchmark speed of load/readRDS correctly

You need to study how reading files works in your operating system. This
question is not about R.
-- 
Sent from my phone. Please excuse my brevity.

On August 22, 2017 5:53:09 AM PDT, raphael.felber at agroscope.admin.ch
wrote:>Dear all
>
>I was thinking about efficient reading data into R and tried several
>ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The
>files file.Rdata and file.rds contain the same data, the first created
>with save(d, ' file.Rdata', compress=F) and the second with
saveRDS(d,
>' file.rds', compress=F).
>
>First I used the function microbenchmark() and was a astonished about
>the max value of the output.
>
>FIRST TEST:
>> library(microbenchmark)
>> microbenchmark(
>+   n <- readRDS('file.rds'),
>+   load('file.Rdata')
>+ )
>Unit: milliseconds
>expr                     min                lq                      
>mean                    median                uq                       
>   max                      neval
>n <- readRDS(fl1)        106.5956      109.6457         237.3844       
>    117.8956              141.9921              10934.162           100
>load(fl2)                  295.0654      301.8162        335.6266      
>  308.3757              319.6965              1915.706              100
>
>It looks like the max value is an outlier.
>
>So I tried:
>SECOND TEST:
>> sapply(1:10, function(x) system.time(n <-
readRDS('file.rds'))[3])
>elapsed               elapsed               elapsed              
>elapsed               elapsed               elapsed              
>elapsed               elapsed                 elapsed              
>elapsed
>10.50                   0.11                       0.11                
>0.11                       0.10                       0.11             
>0.11                       0.11                       0.12             
>         0.12
>> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
>elapsed               elapsed               elapsed              
>elapsed               elapsed               elapsed              
>elapsed               elapsed                 elapsed              
>elapsed
>1.86                    0.29                       0.31                
>0.30                       0.30                       0.31             
>0.30                       0.29                       0.31             
>         0.30
>
>Which confirmed my suspicion; the first time loading the data takes
>much longer than the following times. I suspect that this has something
>to do how the data is assigned and that R doesn't has to 'fully'
read
>the data, if it is read the second time.
>
>So the question remains, how can I make a realistic benchmark test?
>From the first test I would conclude that reading the *.rds file is
>faster. But this holds only for a large number of neval. If I set times
>= 1 then reading the *.Rdata would be faster (as also indicated by the
>second test).
>
>Thanks for any help or comments.
>
>Kind regards
>
>Raphael
>------------------------------------------------------------------------------------
>Raphael Felber, PhD
>Scientific Officer, Climate & Air Pollution
>
>Federal Department of Economic Affairs,
>Education and Research EAER
>Agroscope
>Research Division, Agroecology and Environment
>
>Reckenholzstrasse 191, CH-8046 Z?rich
>Phone +41 58 468 75 11
>Fax     +41 58 468 72 01
>raphael.felber at agroscope.admin.ch<mailto:raphael.felber at
agroscope.admin.ch>
>www.agroscope.ch<http://www.agroscope.ch/>
>
>
>	[[alternative HTML version deleted]]

J C Nash

2017-Aug-22 14:11 UTC

head link

[R] How to benchmark speed of load/readRDS correctly

Not convinced Jeff is completely right about this not concerning R, since
I've found that the application language (R,
perl, etc.) makes a difference in how files are accessed by/to OS. He is
certainly correct that OS (and versions) are
where the actual reading and writing happens, but sometimes the call to those
can be inefficient. (Sorry, I've not got
examples specifically for file reads, but had a case in computation where there
was an 800% i.e., 80000 fold difference
in timing with R, which rather took my breath away. That's probably been
sorted now.) The difficulty in making general
statements is that a rather full set of comparisons over different commands,
datasets, OS and version variants is needed
before the general picture can emerge. Using microbenchmark when you need to
find the bottlenecks is how I'd proceed,
which OP is doing.

About 30 years ago, I did write up some preliminary work, never published, on
estimating the two halves of a copy, that
is, the reading from file and storing to "memory" or a different
storage location. This was via regression with a
singular design matrix, but one can get a minimal length least squares solution
via svd. Possibly relevant today to try
to get at slow links on a network.

JN

On 2017-08-22 09:07 AM, Jeff Newmiller wrote:> You need to study how reading files works in your operating system. This
question is not about R.
>

William Dunlap

2017-Aug-22 16:26 UTC

head link

[R] How to benchmark speed of load/readRDS correctly

The large value for maximum time may be due to garbage collection, which
happens periodically.   E.g., try the following, where the
unlist(as.list()) creates a lot of garbage.  I get a very large time every
102 or 51 iterations and a moderately large time more often

mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <-
unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
#       0%       50%       75%       90%       95%       99%      100%
# 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102  51 102 102 102 102 102 102  51
diff(which(mb$time > quantile(mb$time, .95)))
# [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4
 6 41
#[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4
22



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch>
wrote:
> Dear all
>
> I was thinking about efficient reading data into R and tried several ways
> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files
> file.Rdata and file.rds contain the same data, the first created with
> save(d, ' file.Rdata', compress=F) and the second with saveRDS(d,
'
> file.rds', compress=F).
>
> First I used the function microbenchmark() and was a astonished about the
> max value of the output.
>
> FIRST TEST:
> > library(microbenchmark)
> > microbenchmark(
> +   n <- readRDS('file.rds'),
> +   load('file.Rdata')
> + )
> Unit: milliseconds
>               expr                     min                lq
>          mean                    median                uq
>          max                      neval
> n <- readRDS(fl1)        106.5956      109.6457         237.3844
>     117.8956              141.9921              10934.162           100
>          load(fl2)                  295.0654      301.8162
> 335.6266              308.3757              319.6965              1915.706
>             100
>
> It looks like the max value is an outlier.
>
> So I tried:
> SECOND TEST:
> > sapply(1:10, function(x) system.time(n <-
readRDS('file.rds'))[3])
> elapsed               elapsed               elapsed               elapsed
>              elapsed               elapsed               elapsed
>    elapsed                 elapsed               elapsed
>   10.50                   0.11                       0.11
>      0.11                       0.10                       0.11
>            0.11                       0.11                       0.12
>                  0.12
> > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
> elapsed               elapsed               elapsed               elapsed
>              elapsed               elapsed               elapsed
>    elapsed                 elapsed               elapsed
>    1.86                    0.29                       0.31
>        0.30                       0.30                       0.31
>              0.30                       0.29                       0.31
>                    0.30
>
> Which confirmed my suspicion; the first time loading the data takes much
> longer than the following times. I suspect that this has something to do
> how the data is assigned and that R doesn't has to 'fully' read
the data,
> if it is read the second time.
>
> So the question remains, how can I make a realistic benchmark test? From
> the first test I would conclude that reading the *.rds file is faster. But
> this holds only for a large number of neval. If I set times = 1 then
> reading the *.Rdata would be faster (as also indicated by the second test).
>
> Thanks for any help or comments.
>
> Kind regards
>
> Raphael
> ------------------------------------------------------------
> ------------------------
> Raphael Felber, PhD
> Scientific Officer, Climate & Air Pollution
>
> Federal Department of Economic Affairs,
> Education and Research EAER
> Agroscope
> Research Division, Agroecology and Environment
>
> Reckenholzstrasse 191, CH-8046 Z?rich
> Phone +41 58 468 75 11
> Fax     +41 58 468 72 01
> raphael.felber at agroscope.admin.ch<mailto:raphael.felber at
agroscope.admin.ch
> >
> www.agroscope.ch<http://www.agroscope.ch/>
>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

William Dunlap

2017-Aug-22 17:12 UTC

head link

[R] How to benchmark speed of load/readRDS correctly

Note that if you force a garbage collection each iteration the times are
more stable.  However, on the average it is faster to let the garbage
collector decide when to leap into action.

mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5));
x
<- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000,
control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95,
.99,
1)))
#       0%       50%       75%       90%       95%       99%      100%
# 59.33450  61.33954  63.43457  66.23331  68.93746  74.45629 158.09799



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <wdunlap at tibco.com>
wrote:
> The large value for maximum time may be due to garbage collection, which
> happens periodically.   E.g., try the following, where the
> unlist(as.list()) creates a lot of garbage.  I get a very large time every
> 102 or 51 iterations and a moderately large time more often
>
> mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x
<-
> unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
> plot(mb$time)
> quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
> #       0%       50%       75%       90%       95%       99%      100%
> # 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
> diff(which(mb$time > quantile(mb$time, .99)))
> # [1] 102  51 102 102 102 102 102 102  51
> diff(which(mb$time > quantile(mb$time, .95)))
> # [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4
>  6 41
> #[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4
> 22
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at
agroscope.admin.ch>
> wrote:
>
>> Dear all
>>
>> I was thinking about efficient reading data into R and tried several
ways
>> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files
>> file.Rdata and file.rds contain the same data, the first created with
>> save(d, ' file.Rdata', compress=F) and the second with
saveRDS(d, '
>> file.rds', compress=F).
>>
>> First I used the function microbenchmark() and was a astonished about
the
>> max value of the output.
>>
>> FIRST TEST:
>> > library(microbenchmark)
>> > microbenchmark(
>> +   n <- readRDS('file.rds'),
>> +   load('file.Rdata')
>> + )
>> Unit: milliseconds
>>               expr                     min                lq
>>          mean                    median                uq
>>          max                      neval
>> n <- readRDS(fl1)        106.5956      109.6457         237.3844
>>     117.8956              141.9921              10934.162           100
>>          load(fl2)                  295.0654      301.8162
>> 335.6266              308.3757              319.6965             
1915.706
>>             100
>>
>> It looks like the max value is an outlier.
>>
>> So I tried:
>> SECOND TEST:
>> > sapply(1:10, function(x) system.time(n <-
readRDS('file.rds'))[3])
>> elapsed               elapsed               elapsed
>>  elapsed               elapsed               elapsed              
elapsed
>>              elapsed                 elapsed               elapsed
>>   10.50                   0.11                       0.11
>>        0.11                       0.10                       0.11
>>              0.11                       0.11                       0.12
>>                    0.12
>> > sapply(1:10, function(x)
system.time(load'flie.Rdata'))[3])
>> elapsed               elapsed               elapsed
>>  elapsed               elapsed               elapsed              
elapsed
>>              elapsed                 elapsed               elapsed
>>    1.86                    0.29                       0.31
>>        0.30                       0.30                       0.31
>>              0.30                       0.29                       0.31
>>                    0.30
>>
>> Which confirmed my suspicion; the first time loading the data takes
much
>> longer than the following times. I suspect that this has something to
do
>> how the data is assigned and that R doesn't has to 'fully'
read the data,
>> if it is read the second time.
>>
>> So the question remains, how can I make a realistic benchmark test?
From
>> the first test I would conclude that reading the *.rds file is faster.
But
>> this holds only for a large number of neval. If I set times = 1 then
>> reading the *.Rdata would be faster (as also indicated by the second
test).
>>
>> Thanks for any help or comments.
>>
>> Kind regards
>>
>> Raphael
>> ------------------------------------------------------------
>> ------------------------
>> Raphael Felber, PhD
>> Scientific Officer, Climate & Air Pollution
>>
>> Federal Department of Economic Affairs,
>> Education and Research EAER
>> Agroscope
>> Research Division, Agroecology and Environment
>>
>> Reckenholzstrasse 191, CH-8046 Z?rich
>> Phone +41 58 468 75 11 <+41%2058%20468%2075%2011>
>> Fax     +41 58 468 72 01 <+41%2058%20468%2072%2001>
>> raphael.felber at agroscope.admin.ch<mailto:raphael.felber@
>> agroscope.admin.ch>
>> www.agroscope.ch<http://www.agroscope.ch/>
>>
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more possibly parallel threads

R help - Aug 2017 - How to benchmark speed of load/readRDS correctly

[R] How to benchmark speed of load/readRDS correctly

[R] How to benchmark speed of load/readRDS correctly

[R] How to benchmark speed of load/readRDS correctly

[R] How to benchmark speed of load/readRDS correctly

[R] How to benchmark speed of load/readRDS correctly

Maybe Matching Threads