Vadim Ogranovich
2005-Apr-27  03:27 UTC
[Rd] RE: [R] when can we expect Prof Tierney's compiled R?
Luke, Thank you for sharing the benchmark results. The improvement is very substantial, I am looking forward to the release of the byte compiler! The arithmetic shows that x[i]<- is still the bottleneck. I suspect that this is due to a very involved dispatching/search for the appropriate function on the C level. There might be significant gain if loops somehow cached the result of the initial dispatching. This is what you probably referred to as additional improvements in the engine itself. Though the examples are very simple, I think they are typical of any simulation of dynamic systems where the new state is computed as a transformation of the previous state. In my example, the f1(), the state was a scalar and the transformation was the identity. In any case the timing of the byte-compilation is remarkable! Thanks, Vadim> -----Original Message----- > From: Luke Tierney [mailto:luke@stat.uiowa.edu] > Sent: Tuesday, April 26, 2005 5:50 PM > To: Vadim Ogranovich > Cc: Peter Dalgaard; Jason Liao; r-devel@stat.math.ethz.ch > Subject: RE: [R] when can we expect Prof Tierney's compiled R? > > For what it's worth (probably not much as these simple > benchmarks are rarely representative of real code and so need > to be taken with a huge chunk of salt) here is what happens > with your examples in R 2.1.0 with the current byte compiler. > > Define your examples as functions: > > n = 1e6; iA = seq(2,n); x = double(n); > f1 <- function(x, iA) for (i in iA) x[i] = x[i-1] > f2 <- function(x, iA) for (i in iA) x[i-1] > f3 <- function(x, iA) for (i in iA) 1 > f4 <- function(x, iA) for (i in iA) x[i] = 1.0 > f5 <- function(x, iA) for (i in iA) i-1 > f6 <- function(x, iA) for (i in iA) i > > Make byte compiled versions: > > f1c <- cmpfun(f1) > f2c <- cmpfun(f2) > f3c <- cmpfun(f3) > f4c <- cmpfun(f4) > f5c <- cmpfun(f5) > f6c <- cmpfun(f6) > > and run them: > > > system.time(f1(x, iA)) > [1] 5.43 0.04 5.56 0.00 0.00 > > system.time(f1c(x, iA)) > [1] 1.77 0.03 1.81 0.00 0.00 > > > system.time(f2(x, iA)) > [1] 1.72 0.01 1.74 0.00 0.00 > > system.time(f2c(x, iA)) > [1] 0.63 0.00 0.63 0.00 0.00 > > > system.time(f3(x, iA)) > [1] 0.19 0.00 0.20 0.00 0.00 > > system.time(f3c(x, iA)) > [1] 0.14 0.00 0.15 0.00 0.00 > > > system.time(f4(x, iA)) > [1] 3.78 0.03 3.82 0.00 0.00 > > system.time(f4c(x, iA)) > [1] 1.26 0.02 1.30 0.00 0.00 > > > system.time(f5(x, iA)) > [1] 0.99 0.00 1.00 0.00 0.00 > > system.time(f5c(x, iA)) > [1] 0.30 0.00 0.31 0.00 0.00 > > > system.time(f6(x, iA)) > [1] 0.21 0.00 0.23 0.00 0.00 > > system.time(f6c(x, iA)) > [1] 0.17 0.00 0.20 0.00 0.00 > > I'll let you do the arithmetic. The byte compiler does get > rid of a fair bit of interpreter overhead (which is large in > these kinds of examples compared to most real code) but there > is still considerable room for improvement. The byte code > engine currently uses the same internal C code for doing the > actual work as the interpreter, so improvements there would > help both interpreted and byte compiled code. > > Best, > > luke > > On Fri, 22 Apr 2005, Vadim Ogranovich wrote: > > > If we are on the subject of byte compilation, let me bring > a couple of > > examples which have been puzzling me for some time. I'd > like to know > > a) if the compilation will likely to improve the > performance for this > > type of computations, and b) at least roughly understand > the reasons > > for the observed numbers, specifically why x[i]<- assignment is so > > much slower than x[i] extraction. > > > > The loops below are typical in any recursive calculation > where the new > > value of a vector is based on its immediate neighbor say to > the left. > > Specifically we assign the previous value to the current element. > > > > # this shows that the assignment x[i]<- is the bottleneck > in the loop > >> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i > in iA) x[i] > > = x[i-1]) > > [1] 4.29 0.00 4.30 0.00 0.00 > >> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) > > x[i-1]) > > [1] 1.46 0.01 1.46 0.00 0.00 > > > > # the overhead of the loop itself is reasonably low, just 0.17 sec > >> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) 1) > > [1] 0.17 0.01 0.17 0.00 0.00 > > > > # pure assignment (w/o the extraction x[i]) takes 3.09 sec. > Thus x[i] > > as extraction is (3.09 - 0.17)/(0.79 - 0.17) = 4.7 times > faster than > > x[i]<- as assignment. This looks a bit odd. > >> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i > in iA) x[i] > > = 1.0) > > [1] 3.08 0.00 3.09 0.00 0.00 > > > > > > # this shows that just evaluation of (i-1) takes about > (0.79 - 0.24) > > 0.55 sec on my machine (AMD 64 bit). Looks too slow. > >> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i > in iA) i-1) > > [1] 0.79 0.00 0.79 0.00 0.00 > >> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) i) > > [1] 0.24 0.01 0.24 0.00 0.00 > > > > Thanks, > > Vadim > > > >> -----Original Message----- > >> From: r-help-bounces@stat.math.ethz.ch > >> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Luke Tierney > >> Sent: Friday, April 22, 2005 7:33 AM > >> To: Peter Dalgaard > >> Cc: Jason Liao; r-help@stat.math.ethz.ch > >> Subject: Re: [R] when can we expect Prof Tierney's compiled R? > >> > >> On Wed, 20 Apr 2005, Peter Dalgaard wrote: > >> > >>> Luke Tierney <luke@stat.uiowa.edu> writes: > >>> > >>>> Vectorized operations in R are also as fast as compiled > C (because > >>>> that is what they are :-)). A compiler such as the one > >> I'm working > >>>> on will be able to make most difference for > >> non-vectorizable or not > >>>> very vectorizable code. It may also be able to reduce the > >> need for > >>>> intermediate allocations in vectorizable code, which may > >> have other > >>>> benefits beyond just speed improvements. > >>> > >>> Actually, it has struck me a couple of times that these > >> operations are > >>> not as fast as they could be, since they are outside the > >> scope of fast > >>> BLAS routines, but "embarrassingly parallel" code could easily be > >>> written for the relevant hardware. Even on uniprocessor > >> systems there > >>> might be speedups that the C compiler cannot find (e.g. > because it > >>> cannot assume that source and destination of the operation are > >>> distinct). > >> > >> My guess is that for anything beyond basic operations we > are doing OK > >> on uniprocessors. but it would be useful to do some testing to be > >> sure. For the basic operations I suspect we are paying a > heavy price > >> for the way we handle recycling, both in terms of overhead as such > >> and in terms of inhibiting compiler optimizations. For > performance it > >> would probably be better to code the scalar-vector, > >> equal-length-vector, and general cases separately, though > keeping the > >> code maintainable may be a bit of a challenge. Again testing on a > >> range of platforms and compilers would be useful. > >> > >> With multiprocessors likely to become more widely > available it would > >> be good to look into ways of factoring the vectorized math > code so we > >> can slide in one that uses threads when approprate. This should > >> dovetail nicely with compilation to identify larger vectorized > >> expressions that can be parallelized as a unit; I hope to > look into > >> this a bit this summer. > >> > >> luke > >> > >> > >> > >> -- > >> Luke Tierney > >> Chair, Statistics and Actuarial Science Ralph E. Wareham > Professor of > >> Mathematical Sciences > >> University of Iowa Phone: 319-335-3386 > >> Department of Statistics and Fax: 319-335-3017 > >> Actuarial Science > >> 241 Schaeffer Hall email: luke@stat.uiowa.edu > >> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > >> > >> ______________________________________________ > >> R-help@stat.math.ethz.ch mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide! > >> http://www.R-project.org/posting-guide.html > >> > > > > > > -- > Luke Tierney > Chair, Statistics and Actuarial Science > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke@stat.uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >
Prof Brian Ripley
2005-Apr-27  10:13 UTC
[Rd] RE: [R] when can we expect Prof Tierney's compiled R?
On Tue, 26 Apr 2005, Vadim Ogranovich wrote:> Thank you for sharing the benchmark results. The improvement is very > substantial, I am looking forward to the release of the byte compiler! > > The arithmetic shows that x[i]<- is still the bottleneck. I suspect that > this is due to a very involved dispatching/search for the appropriate > function on the C level. There might be significant gain if loops > somehow cached the result of the initial dispatching. This is what you > probably referred to as additional improvements in the engine itself.I'd be surprised if dispatching were the issue: have you (C-level) profiled to find out? Please do so: these statements do tend to get perpetuated as fact. You cannot cache the result, as [<- can change the class of x, as could other operations done by the rhs (e.g. if it were x[i] <- g(x, i) the function g could change its argument). There are a large number of allocations going on (e.g. you need to create a length-one vector x[i-1]), and> gc.time(TRUE)[1] 0 0 0 0 0> system.time(f1(x, iA), gcFirst=TRUE)[1] 6.46 0.02 6.49 0.00 0.00> gc.time()[1] 1.83 1.30 1.85 0.00 0.00 (although note the comment in ?gc.time). It is typical that gc()-ing is a major component of the time for such simple tasks, and allocations can be even more. Note the use of gcFirst=TRUE (one could gc() before gc.time to be even fairer). Comparable figures for f2> system.time(f2(x, iA), gcFirst=TRUE)[1] 2.13 0.00 2.13 0.00 0.00> gc.time()[1] 0.69 0.54 0.71 0.00 0.00 It's quite typical for gc()-ing to take 1/3 of the time (as reported by gc.time) in trivial problems like this.> Though the examples are very simple, I think they are typical of any > simulation of dynamic systems where the new state is computed as a > transformation of the previous state. In my example, the f1(), the state > was a scalar and the transformation was the identity.I don't believe they are typical, as the transformation is so trivial. Of course, `simulation of dynamic systems' is not a typical task for R.> In any case the timing of the byte-compilation is remarkable!> Thanks, > Vadim > > > >> -----Original Message----- >> From: Luke Tierney [mailto:luke@stat.uiowa.edu] >> Sent: Tuesday, April 26, 2005 5:50 PM >> To: Vadim Ogranovich >> Cc: Peter Dalgaard; Jason Liao; r-devel@stat.math.ethz.ch >> Subject: RE: [R] when can we expect Prof Tierney's compiled R? >> >> For what it's worth (probably not much as these simple >> benchmarks are rarely representative of real code and so need >> to be taken with a huge chunk of salt) here is what happens >> with your examples in R 2.1.0 with the current byte compiler. >> >> Define your examples as functions: >> >> n = 1e6; iA = seq(2,n); x = double(n); >> f1 <- function(x, iA) for (i in iA) x[i] = x[i-1] >> f2 <- function(x, iA) for (i in iA) x[i-1] >> f3 <- function(x, iA) for (i in iA) 1 >> f4 <- function(x, iA) for (i in iA) x[i] = 1.0 >> f5 <- function(x, iA) for (i in iA) i-1 >> f6 <- function(x, iA) for (i in iA) i >> >> Make byte compiled versions: >> >> f1c <- cmpfun(f1) >> f2c <- cmpfun(f2) >> f3c <- cmpfun(f3) >> f4c <- cmpfun(f4) >> f5c <- cmpfun(f5) >> f6c <- cmpfun(f6) >> >> and run them: >> >> > system.time(f1(x, iA)) >> [1] 5.43 0.04 5.56 0.00 0.00 >> > system.time(f1c(x, iA)) >> [1] 1.77 0.03 1.81 0.00 0.00 >> >> > system.time(f2(x, iA)) >> [1] 1.72 0.01 1.74 0.00 0.00 >> > system.time(f2c(x, iA)) >> [1] 0.63 0.00 0.63 0.00 0.00 >> >> > system.time(f3(x, iA)) >> [1] 0.19 0.00 0.20 0.00 0.00 >> > system.time(f3c(x, iA)) >> [1] 0.14 0.00 0.15 0.00 0.00 >> >> > system.time(f4(x, iA)) >> [1] 3.78 0.03 3.82 0.00 0.00 >> > system.time(f4c(x, iA)) >> [1] 1.26 0.02 1.30 0.00 0.00 >> >> > system.time(f5(x, iA)) >> [1] 0.99 0.00 1.00 0.00 0.00 >> > system.time(f5c(x, iA)) >> [1] 0.30 0.00 0.31 0.00 0.00 >> >> > system.time(f6(x, iA)) >> [1] 0.21 0.00 0.23 0.00 0.00 >> > system.time(f6c(x, iA)) >> [1] 0.17 0.00 0.20 0.00 0.00 >> >> I'll let you do the arithmetic. The byte compiler does get >> rid of a fair bit of interpreter overhead (which is large in >> these kinds of examples compared to most real code) but there >> is still considerable room for improvement. The byte code >> engine currently uses the same internal C code for doing the >> actual work as the interpreter, so improvements there would >> help both interpreted and byte compiled code. >> >> Best, >> >> luke >> >> On Fri, 22 Apr 2005, Vadim Ogranovich wrote: >> >>> If we are on the subject of byte compilation, let me bring >> a couple of >>> examples which have been puzzling me for some time. I'd >> like to know >>> a) if the compilation will likely to improve the >> performance for this >>> type of computations, and b) at least roughly understand >> the reasons >>> for the observed numbers, specifically why x[i]<- assignment is so >>> much slower than x[i] extraction. >>> >>> The loops below are typical in any recursive calculation >> where the new >>> value of a vector is based on its immediate neighbor say to >> the left. >>> Specifically we assign the previous value to the current element. >>> >>> # this shows that the assignment x[i]<- is the bottleneck >> in the loop >>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i >> in iA) x[i] >>> = x[i-1]) >>> [1] 4.29 0.00 4.30 0.00 0.00 >>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) >>> x[i-1]) >>> [1] 1.46 0.01 1.46 0.00 0.00 >>> >>> # the overhead of the loop itself is reasonably low, just 0.17 sec >>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) 1) >>> [1] 0.17 0.01 0.17 0.00 0.00 >>> >>> # pure assignment (w/o the extraction x[i]) takes 3.09 sec. >> Thus x[i] >>> as extraction is (3.09 - 0.17)/(0.79 - 0.17) = 4.7 times >> faster than >>> x[i]<- as assignment. This looks a bit odd. >>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i >> in iA) x[i] >>> = 1.0) >>> [1] 3.08 0.00 3.09 0.00 0.00 >>> >>> >>> # this shows that just evaluation of (i-1) takes about >> (0.79 - 0.24) >>> 0.55 sec on my machine (AMD 64 bit). Looks too slow. >>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i >> in iA) i-1) >>> [1] 0.79 0.00 0.79 0.00 0.00 >>>> n = 1e6; iA = seq(2,n); x = double(n); system.time(for (i in iA) i) >>> [1] 0.24 0.01 0.24 0.00 0.00 >>> >>> Thanks, >>> Vadim >>> >>>> -----Original Message----- >>>> From: r-help-bounces@stat.math.ethz.ch >>>> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Luke Tierney >>>> Sent: Friday, April 22, 2005 7:33 AM >>>> To: Peter Dalgaard >>>> Cc: Jason Liao; r-help@stat.math.ethz.ch >>>> Subject: Re: [R] when can we expect Prof Tierney's compiled R? >>>> >>>> On Wed, 20 Apr 2005, Peter Dalgaard wrote: >>>> >>>>> Luke Tierney <luke@stat.uiowa.edu> writes: >>>>> >>>>>> Vectorized operations in R are also as fast as compiled >> C (because >>>>>> that is what they are :-)). A compiler such as the one >>>> I'm working >>>>>> on will be able to make most difference for >>>> non-vectorizable or not >>>>>> very vectorizable code. It may also be able to reduce the >>>> need for >>>>>> intermediate allocations in vectorizable code, which may >>>> have other >>>>>> benefits beyond just speed improvements. >>>>> >>>>> Actually, it has struck me a couple of times that these >>>> operations are >>>>> not as fast as they could be, since they are outside the >>>> scope of fast >>>>> BLAS routines, but "embarrassingly parallel" code could easily be >>>>> written for the relevant hardware. Even on uniprocessor >>>> systems there >>>>> might be speedups that the C compiler cannot find (e.g. >> because it >>>>> cannot assume that source and destination of the operation are >>>>> distinct). >>>> >>>> My guess is that for anything beyond basic operations we >> are doing OK >>>> on uniprocessors. but it would be useful to do some testing to be >>>> sure. For the basic operations I suspect we are paying a >> heavy price >>>> for the way we handle recycling, both in terms of overhead as such >>>> and in terms of inhibiting compiler optimizations. For >> performance it >>>> would probably be better to code the scalar-vector, >>>> equal-length-vector, and general cases separately, though >> keeping the >>>> code maintainable may be a bit of a challenge. Again testing on a >>>> range of platforms and compilers would be useful. >>>> >>>> With multiprocessors likely to become more widely >> available it would >>>> be good to look into ways of factoring the vectorized math >> code so we >>>> can slide in one that uses threads when approprate. This should >>>> dovetail nicely with compilation to identify larger vectorized >>>> expressions that can be parallelized as a unit; I hope to >> look into >>>> this a bit this summer. >>>> >>>> luke >>>> >>>> >>>> >>>> -- >>>> Luke Tierney >>>> Chair, Statistics and Actuarial Science Ralph E. Wareham >> Professor of >>>> Mathematical Sciences >>>> University of Iowa Phone: 319-335-3386 >>>> Department of Statistics and Fax: 319-335-3017 >>>> Actuarial Science >>>> 241 Schaeffer Hall email: luke@stat.uiowa.edu >>>> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >>>> >>>> ______________________________________________ >>>> R-help@stat.math.ethz.ch mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide! >>>> http://www.R-project.org/posting-guide.html >>>> >>> >>> >> >> -- >> Luke Tierney >> Chair, Statistics and Actuarial Science >> Ralph E. Wareham Professor of Mathematical Sciences >> University of Iowa Phone: 319-335-3386 >> Department of Statistics and Fax: 319-335-3017 >> Actuarial Science >> 241 Schaeffer Hall email: luke@stat.uiowa.edu >> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >> > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595