Chris Evans
2015-Oct-17 16:18 UTC
[R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines
I think I am failing to understand how boot() uses the parallel package on linux machines, using R 3.2.2 on three different machines with 2, 4 and 8 cores all results in a slow down if I use "multicore" and "ncpus". Here's the code that creates a very simple reproducible example: bootReps <- 500 seed <- 12345 set.seed(seed) require(boot) dat <- rnorm(500) bootMean <- function(dat,ind) { mean(dat[ind]) } start.time <- proc.time() bootDat <- boot(dat,bootMean,bootReps) boot.ci(bootDat,type="norm") stop.time <- proc.time() elapsed.time1 <- stop.time - start.time require(parallel) set.seed(seed) start.time <- proc.time() bootDat <- boot(dat,bootMean,bootReps, parallel="multicore", ncpus=2) boot.ci(bootDat,type="norm") stop.time <- proc.time() elapsed.time2 <- stop.time - start.time elapsed.time1 elapsed.time2 Running that on my old Dell Latitude E6500 running Debian Squeeze and using 32 bit R 3.2.2 gives me:> bootReps <- 500 > seed <- 12345 > set.seed(seed) > require(boot) > dat <- rnorm(500) > bootMean <- function(dat,ind) {+ mean(dat[ind]) + }> start.time <- proc.time() > bootDat <- boot(dat,bootMean,bootReps) > boot.ci(bootDat,type="norm")BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 500 bootstrap replicates CALL : boot.ci(boot.out = bootDat, type = "norm") Intervals : Level Normal 95% (-0.0034, 0.1677 ) Calculations and Intervals on Original Scale> stop.time <- proc.time() > elapsed.time1 <- stop.time - start.time > require(parallel) > set.seed(seed) > start.time <- proc.time() > bootDat <- boot(dat,bootMean,bootReps,+ parallel="multicore", + ncpus=2)> boot.ci(bootDat,type="norm")BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 500 bootstrap replicates CALL : boot.ci(boot.out = bootDat, type = "norm") Intervals : Level Normal 95% (-0.0030, 0.1675 ) Calculations and Intervals on Original Scale> stop.time <- proc.time() > elapsed.time2 <- stop.time - start.time > elapsed.time1user system elapsed 0.028 0.000 0.174> elapsed.time2user system elapsed 4.336 2.572 0.166 A very slightly different 95% CI reflecting the way that invoking parallel="multicore" changes the seed setting and a huge deterioration in execution speed rather than any improvement. On a more recent four core Toshiba and using ncpus=4 again on Debian Squeeze, 32bit R, I get exactly the same CIs and this timing:> elapsed.time1user system elapsed 0.032 0.000 0.100> elapsed.time2user system elapsed 0.032 0.020 0.049>and on a Mac Mini with eight cores on Squeeze but with 64bit R I get the same CIs and this timing:> elapsed.time1user system elapsed 0.012 0.004 0.017> elapsed.time2user system elapsed 0.032 0.012 0.024 I am clearly missing something, or perhaps something else is choking the work, not the CPU power, RAM? I've tried searching for similar reports on the web and was surprised to find nothing using what seemed plausible search strategies. Anyone able to help me? I'd desperately like to get a marked speed up for some simulation work I'm doing on the Mac mini as it's taking days to run at the moment. The computational intensive bits in the models is a bit more complicated than this here (!) but most of the workload will be in the bootstrapping and the function I'm bootstrapping for real, although it's a bit more complex than a simple mean, isn't that complex though it does involve a stratified bootstrap rather than a simple one. I see very similar marginal speed _losses_ invoking more than one core for that work just as with this very simple example. TIA, Chris
Milan Bouchet-Valat
2015-Oct-17 17:13 UTC
[R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines
Le samedi 17 octobre 2015 ? 17:18 +0100, Chris Evans a ?crit :> I think I am failing to understand how boot() uses the parallel > package on linux machines, using R 3.2.2 on three different machines > with 2, 4 and 8 cores all results in a slow down if I use "multicore" > and "ncpus". Here's the code that creates a very simple reproducible > example: > > bootReps <- 500 > seed <- 12345 > set.seed(seed) > require(boot) > dat <- rnorm(500) > bootMean <- function(dat,ind) { > mean(dat[ind]) > } > start.time <- proc.time() > bootDat <- boot(dat,bootMean,bootReps) > boot.ci(bootDat,type="norm") > stop.time <- proc.time() > elapsed.time1 <- stop.time - start.time > require(parallel) > set.seed(seed) > start.time <- proc.time() > bootDat <- boot(dat,bootMean,bootReps, > parallel="multicore", > ncpus=2) > boot.ci(bootDat,type="norm") > stop.time <- proc.time() > elapsed.time2 <- stop.time - start.time > elapsed.time1 > elapsed.time2 > > Running that on my old Dell Latitude E6500 running Debian Squeeze and > using 32 bit R 3.2.2 gives me: > > > bootReps <- 500 > > seed <- 12345 > > set.seed(seed) > > require(boot) > > dat <- rnorm(500) > > bootMean <- function(dat,ind) { > + mean(dat[ind]) > + } > > start.time <- proc.time() > > bootDat <- boot(dat,bootMean,bootReps) > > boot.ci(bootDat,type="norm") > BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS > Based on 500 bootstrap replicates > > CALL : > boot.ci(boot.out = bootDat, type = "norm") > > Intervals : > Level Normal > 95% (-0.0034, 0.1677 ) > Calculations and Intervals on Original Scale > > stop.time <- proc.time() > > elapsed.time1 <- stop.time - start.time > > require(parallel) > > set.seed(seed) > > start.time <- proc.time() > > bootDat <- boot(dat,bootMean,bootReps, > + parallel="multicore", > + ncpus=2) > > boot.ci(bootDat,type="norm") > BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS > Based on 500 bootstrap replicates > > CALL : > boot.ci(boot.out = bootDat, type = "norm") > > Intervals : > Level Normal > 95% (-0.0030, 0.1675 ) > Calculations and Intervals on Original Scale > > stop.time <- proc.time() > > elapsed.time2 <- stop.time - start.time > > elapsed.time1 > user system elapsed > 0.028 0.000 0.174 > > elapsed.time2 > user system elapsed > 4.336 2.572 0.166 > > A very slightly different 95% CI reflecting the way that invoking > parallel="multicore" changes the seed setting and a huge > deterioration in execution speed rather than any improvement. > > On a more recent four core Toshiba and using ncpus=4 again on Debian > Squeeze, 32bit R, I get exactly the same CIs and this timing: > > > elapsed.time1 > user system elapsed > 0.032 0.000 0.100 > > elapsed.time2 > user system elapsed > 0.032 0.020 0.049 > > > > and on a Mac Mini with eight cores on Squeeze but with 64bit R I get > the same CIs and this timing: > > > elapsed.time1 > user system elapsed > 0.012 0.004 0.017 > > elapsed.time2 > user system elapsed > 0.032 0.012 0.024 > > I am clearly missing something, or perhaps something else is choking > the work, not the CPU power, RAM? I've tried searching for similar > reports on the web and was surprised to find nothing using what > seemed plausible search strategies. > > Anyone able to help me? I'd desperately like to get a marked speed > up for some simulation work I'm doing on the Mac mini as it's taking > days to run at the moment. The computational intensive bits in the > models is a bit more complicated than this here (!) but most of the > workload will be in the bootstrapping and the function I'm > bootstrapping for real, although it's a bit more complex than a > simple mean, isn't that complex though it does involve a stratified > bootstrap rather than a simple one. I see very similar marginal > speed _losses_ invoking more than one core for that work just as with > this very simple example.Parallel execution is useful only when the operation you want to run takes enough time. Here, starting the workers takes more time than computing the means. You should try with a larger number of replicates, or a slower computation. Regards> TIA, > > Chris > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jeff Newmiller
2015-Oct-17 17:28 UTC
[R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines
None of this is surprising. If the calculations you divide your work up into are small, then the overhead of communicating between parallel processes will be a relatively large penalty to pay. You have to break your problem up into larger chunks and depend on vector processing within processes to keep the cpu busy doing useful work. Also, I am not aware of any model of Mac Mini that has 8 physical cores... 4 is the max. Virtual cores gain a logical simplification of multiprocessing but do not offer actual improved performance because there are only as many physical data paths and registers as there are cores. Note that your problems are with long-running simulations... your examples are too small to demonstrate the actual balance of processing vs. communication overhead. Before you draw conclusions, try upping bootReps by a few orders of magnitude, and run your test code a couple of times to stabilize the memory conditions and obtain some consistency in timings. I have never used the parallel option in the boot package before... I have always rolled my own to allow me to decide how much work to do within the worker processes before returning from them. (This is particularly severe when using snow, but not necessarily something you can neglect with multicore.) On Sat, 17 Oct 2015, Chris Evans wrote:> I think I am failing to understand how boot() uses the parallel package on linux machines, using R 3.2.2 on three different machines with 2, 4 and 8 cores all results in a slow down if I use "multicore" and "ncpus". Here's the code that creates a very simple reproducible example: > > bootReps <- 500 > seed <- 12345 > set.seed(seed) > require(boot) > dat <- rnorm(500) > bootMean <- function(dat,ind) { > mean(dat[ind]) > } > start.time <- proc.time() > bootDat <- boot(dat,bootMean,bootReps) > boot.ci(bootDat,type="norm") > stop.time <- proc.time() > elapsed.time1 <- stop.time - start.time > require(parallel) > set.seed(seed) > start.time <- proc.time() > bootDat <- boot(dat,bootMean,bootReps, > parallel="multicore", > ncpus=2) > boot.ci(bootDat,type="norm") > stop.time <- proc.time() > elapsed.time2 <- stop.time - start.time > elapsed.time1 > elapsed.time2 >> Running that on my old Dell Latitude E6500 running Debian Squeeze and > using 32 bit R 3.2.2 gives me: > >> bootReps <- 500 >> seed <- 12345 >> set.seed(seed) >> require(boot) >> dat <- rnorm(500) >> bootMean <- function(dat,ind) { > + mean(dat[ind]) > + } >> start.time <- proc.time() >> bootDat <- boot(dat,bootMean,bootReps) >> boot.ci(bootDat,type="norm") > BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS > Based on 500 bootstrap replicates > > CALL : > boot.ci(boot.out = bootDat, type = "norm") > > Intervals : > Level Normal > 95% (-0.0034, 0.1677 ) > Calculations and Intervals on Original Scale >> stop.time <- proc.time() >> elapsed.time1 <- stop.time - start.time >> require(parallel) >> set.seed(seed) >> start.time <- proc.time() >> bootDat <- boot(dat,bootMean,bootReps, > + parallel="multicore", > + ncpus=2) >> boot.ci(bootDat,type="norm") > BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS > Based on 500 bootstrap replicates > > CALL : > boot.ci(boot.out = bootDat, type = "norm") > > Intervals : > Level Normal > 95% (-0.0030, 0.1675 ) > Calculations and Intervals on Original Scale >> stop.time <- proc.time() >> elapsed.time2 <- stop.time - start.time >> elapsed.time1 > user system elapsed > 0.028 0.000 0.174 >> elapsed.time2 > user system elapsed > 4.336 2.572 0.166 > > A very slightly different 95% CI reflecting the way that invoking > parallel="multicore" changes the seed setting and a huge deterioration > in execution speed rather than any improvement. >> On a more recent four core Toshiba and using ncpus=4 again on Debian > Squeeze, 32bit R, I get exactly the same CIs and this timing: > >> elapsed.time1 > user system elapsed > 0.032 0.000 0.100 >> elapsed.time2 > user system elapsed > 0.032 0.020 0.049 >> > > and on a Mac Mini with eight cores on Squeeze but with 64bit R I get the > same CIs and this timing: > >> elapsed.time1 > user system elapsed > 0.012 0.004 0.017 >> elapsed.time2 > user system elapsed > 0.032 0.012 0.024 > > I am clearly missing something, or perhaps something else is choking the work, not the CPU power, RAM? I've tried searching for similar reports on the web and was surprised to find nothing using what seemed plausible search strategies. > > Anyone able to help me? I'd desperately like to get a marked speed up for some simulation work I'm doing on the Mac mini as it's taking days to run at the moment. The computational intensive bits in the models is a bit more complicated than this here (!) but most of the workload will be in the bootstrapping and the function I'm bootstrapping for real, although it's a bit more complex than a simple mean, isn't that complex though it does involve a stratified bootstrap rather than a simple one. I see very similar marginal speed _losses_ invoking more than one core for that work just as with this very simple example. > > TIA, > > Chris > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
Chris Evans
2015-Oct-18 09:23 UTC
[R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines
----- Original Message -----> From: "Milan Bouchet-Valat" <nalimilan at club.fr> > To: "Chris Evans" <chrishold at psyctc.org>, r-help at r-project.org > Sent: Saturday, 17 October, 2015 18:13:40 > Subject: Re: [R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines> Le samedi 17 octobre 2015 ? 17:18 +0100, Chris Evans a ?crit : >> I think I am failing to understand how boot() uses the parallel >> package on linux machines, using R 3.2.2 on three different machines >> with 2, 4 and 8 cores all results in a slow down if I use "multicore" >> and "ncpus". Here's the code that creates a very simple reproducible >> example:... rest of my post deleted to save space ...> Parallel execution is useful only when the operation you want to run > takes enough time. Here, starting the workers takes more time than > computing the means. You should try with a larger number of replicates, > or a slower computation. >Aha. Makes perfect sense of course and explains what I'm seeing both for this and the real work which also involves bootstrapping a pretty simple function. Merci Milan, Chris
Chris Evans
2015-Oct-18 09:31 UTC
[R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines
As with Milan's answer: perfect explanation and hugely appreciated. A few follow up questions/comments below. ----- Original Message -----> From: "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> > To: "Chris Evans" <chrishold at psyctc.org> > Cc: r-help at r-project.org > Sent: Saturday, 17 October, 2015 18:28:12 > Subject: Re: [R] No speed up using the parallel package and ncpus > 1 with boot() on linux machines> None of this is surprising. If the calculations you divide your work up > into are small, then the overhead of communicating between parallel > processes will be a relatively large penalty to pay. You have to break > your problem up into larger chunks and depend on vector processing within > processes to keep the cpu busy doing useful work.Aha. Got it!> Also, I am not aware of any model of Mac Mini that has 8 physical cores... > 4 is the max. Virtual cores gain a logical simplification of > multiprocessing but do not offer actual improved performance because > there are only as many physical data paths and registers as there are > cores.Ah. Hadn't thought of that. It's a machine I rent, I thought it was a mac mini. detectCores() reports 8 but perhaps they are virtual cores. /proc/cpuinfo says the processor is an Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz and shows 8 cores but again ... perhaps they are virtual. What's the best way to get a true core count?> Note that your problems are with long-running simulations... your examples > are too small to demonstrate the actual balance of processing vs. > communication overhead. Before you draw conclusions, try upping bootReps > by a few orders of magnitude, and run your test code a couple > of times to stabilize the memory conditions and obtain some consistency > in timings.OK. Good advice again but what you are saying, and the findings I had there, are pretty consistent with what I was seeing with long running things with bootReps up at 10k and I think you've told me what I really want to know. I think the simplest way to parallelise may actually be fine for me: I'll run four (or maybe eight) separate R jobs (having a look at swapping to make sure I'm not pushing beyond physical RAM, don't think these simulations will.> I have never used the parallel option in the boot package before... I have > always rolled my own to allow me to decide how much work to do within the > worker processes before returning from them. (This is particularly severe > when using snow, but not necessarily something you can neglect with > multicore.)That sounds like an impressive and obviously pertinent approach. I think, as I say, I may be able to get away with a very simple approach that runs parallel simulations and then aggregates the data from each and analyses that. Many thanks Jeff. Brilliant help. Chris> On Sat, 17 Oct 2015, Chris Evans wrote: > >> I think I am failing to understand how boot() uses the parallel package on linux... rest of my original post deleted to save space ...> --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > ---------------------------------------------------------------------------