Seeliger.Curt at epamail.epa.gov
2011-Apr-19  19:52 UTC
[R] doSMP package works better than perfect, at least sometimes.
Some might have noticed that REvolution Computing released  the doSMP 
package to the general public about a month and a half ago, which allows 
multiple cores to be accessed for parallel computation in R.  Some of our 
physical habitat calculations were taking an extraordinary amount of time 
to complete and required over-weekend runs, which prompted our interest in 
this package.  What follows is the results of those tests.
In brief, the toy test resulted in speed increase of the calculations to a 
plausible degree depending on the number of workers (cores? threads?) 
used.  Timing of our real world application gave results that were better 
than perfect.  In fact, they were staggeringly better than perfect.  Maybe 
someone can suggest why.  Also in brief, I'd like to quickly thank 
REvolution for providing us with this really great package.
These metrics are based on a for-loop construct that is difficult to 
vectorize, so a toy test was developed (code given below) which loops 
through simple sqrt() calculations in a way one might find in Burns' third 
circle of Hell.  Short loops were used to cause thrashing  during 
processor assigning, and longer ones used to simulate 'harder' or more 
time-consuming tasks.  The processing time of each set of tasks was 
measured for basic unvectorized for() looping, foreach() %do% looping, and 
foreach() %dopar% looping, using a 4 core Xenon PC running XP with 3.2 GB 
RAM.
Using 3 'workers', the increase in speed due to iteration with the 
foreach() %do% construct showed the expected amount of thrashing for 
small/easy calculations, with the internal overhead being overcome after 
roughly 10,000 total calculations.  The increase due to use of SMP 
relative to the single-processor iteration showed it to start being worth 
while with only 10 groups, regardless of the group size.
Speedup of foreach() %do% construct relative to basic for():
     n    g=   1         10      100     1000
    10 0.6000000  0.6250000 0.900000 1.010499
   100 0.7230769  0.9333333 1.180000 1.231752
  1000 0.7968750  1.8987730 3.564801 2.078614
 10000 2.1724356 10.4700474 8.002192       NA
Speedup of foreach() %dopar% construct relative to foreach() %do% 
construct:
     n    g=    1       10      100     1000
    10 0.09803922 1.142857 1.875000 2.164773
   100 0.94202899 1.363636 2.702703 2.689359
  1000 0.81012658 1.429825 2.951413 2.602386
 10000 0.87239919 1.182743 1.548661       NA
Using 7 'workers', the increase in speed due to iteration with the 
foreach() %do% construct was not as close to the results with three 
'workers' as expected, though thrashing was still evident when the
number
of calculations were small.  The increase due to using multiple cores 
maxed out around 5.5, below the theoretically perfect 7x speedup but not 
consistently high for all conditions.  I'm not sure if this is system 
noise, or if some other constraint is influencing the results.
Speedup of foreach() %do% construct relative to basic for():
     n    g=  1        10       100     1000
    10 0.400000 1.1111111 0.9210526 1.037190
   100 0.650000 0.8831169 1.1215881 1.199677
  1000 0.768116 1.7843360 3.5691298 2.051362
 10000 1.981686 8.8194254 8.2673038       NA
Speedup of foreach() %dopar% construct relative to foreach() %do% 
construct:
     n    g=   1       10      100     1000
    10 0.8333333 1.285714 4.222222 3.751938
   100 0.9523810 1.452830 5.302632 5.516474
  1000 0.9409091 1.284257 3.123677 3.848393
 10000 0.8640463 1.073046 1.609020       NA
The real world test was to time our residual pool calculations for about 
1200 channels (80-150 depths recorded in each) on the same machine using 7 
'workers'.  This had previously taken 32 hours and 2 minutes, judging by
the timestamp of the intermediate files created during calculation.  With 
doSMP the calculations took 7 minutes and the results were identical. 
Nothing in the toy tests would have indicated we'd see these calculations 
sped up by a factor of  275.  Since 275 is much larger than 7, this is due 
to more than just making unused cores available and I suspect it's due to 
internal compilation.  A quick check of the docs does not support this 
conjecture.  Does anyone have a better explanation?
Thanks for your input,
cur
ps - Thanks to Revolution for releasing this package.  They occasionally 
get kicked for their closed-source addon to R, but it's clear that their 
releases of packages like doSMP and foreach are important contributions to 
the community.
###### Toy test code follows:######
# Toy SMP
memory.limit(3000)
require(doSMP)
require(reshape2)
getDoParWorkers()
w<- startWorkers(workerCount=3)
registerDoSMP(w)
timeSMP <- function(g, n)
# g = number of groups to process
# n = size of each group.
{
  for(rep in 1:3) {
      times <- NULL
      dd <- data.frame(k=rep(1:g, n), x=runif(g*n))
      ddSplit <- split(dd, dd$k)
      tt<-system.time({
        dd2 <- foreach(e=names(ddSplit), .combine=rbind) %dopar% {   # SMP
                   elem <- ddSplit[[e]]
                   for (i in 1:nrow(elem)) {
                       elem$y[i] <- sqrt(elem$x[i])
                   }
                   elem
               }
      })
 
      times <- rbind(times, 
as.data.frame(cbind(t(tt),g=g,n=n,method='SMPVectorized')))
      tt<-system.time({
        dd3 <- foreach(e=names(ddSplit), .combine=rbind) %do% {  # Single 
core
                   elem <- ddSplit[[e]]
                   for (i in 1:nrow(elem)) {
                       elem$y[i] <- sqrt(elem$x[i])
                   }
                   elem
               }
      })
      times <- rbind(times, 
as.data.frame(cbind(t(tt),g=g,n=n,method='1CoreVectorized')))
      dd4<-NULL
      tt<-system.time({  # loop through list elements
               for (e in names(ddSplit)) {
                   elem <- ddSplit[[e]]
                   for (i in 1:nrow(elem)) {
                       elem$y[i] <- sqrt(elem$x[i])
                   }
                   dd4 <- rbind(dd4, elem)
               }
      })
      times <- rbind(times, 
as.data.frame(cbind(t(tt),g=g,n=n,method='unvectorized')))
      write.table(times, file='c:/r/dosmpTest.csv', append=TRUE, 
row.names=FALSE, sep=',')
 
  } # end of repetition loop
 
}
summarizeTimes <- function(fname)
# Summarize timing results and display them.
{
  # read in results, format columns and make methods more 'variable-name 
friendly'.
  times <- read.csv(fname, stringsAsFactors=FALSE)
  times <- subset(times, user.self != 'user.self', 
select=-c(user.child,sys.child))
  times$user.self <- as.numeric(times$user.self)
  times$sys.self <- as.numeric(times$sys.self)
  times$elapsed <- as.numeric(times$elapsed)
  times$g <- as.numeric(times$g)
  times$n <- as.numeric(times$n)
  # Summarize
  stats <- merge(aggregate(list(meanElapsed=times$elapsed)
                          ,list(g=times$g, n=times$n, method=times$method)
                          ,mean, na.rm=TRUE
                          )
                ,aggregate(list(meanSelf=times$user.self)
                          ,list(g=times$g, n=times$n, method=times$method)
                          ,mean, na.rm=TRUE
                          )
                ,by=c('g','n','method')
                )
  # transpose to wide
  mm <- melt(stats, id=c('g','n','method'))
  tstats <- dcast(mm, g + n ~ variable+method)
  tstats$speedup.elapsed1 <- tstats$meanElapsed_unvectorized  / 
tstats$meanElapsed_1CoreVectorized
  tstats$speedup.elapsed3 <- tstats$meanElapsed_1CoreVectorized  / 
tstats$meanElapsed_SMPVectorized
  speedupVectorizing <-
dcast(tstats[c('g','n','speedup.elapsed1')], g~n,
value_var='speedup.elapsed1')
  speedupSMP <-
dcast(tstats[c('g','n','speedup.elapsed3')], g~n,
value_var='speedup.elapsed3')
  return(list(vectoring=speedupVectorizing, smp=speedupSMP))
}
timeSMP(10,1)                 # make it thrash as much as possible
timeSMP(100,1)
timeSMP(1000,1)
timeSMP(10000,1)
#timeSMP(100000,1)           # too much memory
#timeSMP(1000000,1)          # too much memory
timeSMP(10,10)
timeSMP(100,10)
timeSMP(1000,10)
timeSMP(10000,10)
timeSMP(10,100)
timeSMP(100,100)
timeSMP(1000,100)
timeSMP(10000,100)
timeSMP(10,1000)
timeSMP(100,1000)
timeSMP(1000,1000)
timeSMP(10000,1000)
timeSMP(10,10000)
timeSMP(100,10000)
# The following take up too much memory, even with a 3GB memory limit.
#timeSMP(1000,5000)
#timeSMP(5000,100)
#timeSMP(5000,1000)
#timeSMP(5000,5000)
summarizeTimes('c:/r/dosmpTest.csv')
-- 
Curt Seeliger, Data Ranger
Raytheon Information Services - Contractor to ORD
seeliger.curt@epa.gov
541/754-4638
	[[alternative HTML version deleted]]
Maybe Matching Threads
- How to use doSMP(revoIPC) with R 2.15.x version
- Why was the ‘doSMP’ package removed from CRAN?
- Ubuntu Maverick and revoIPC/doSMP
- Can't load "doSMP" from REvolutionR in regular R2.11.0
- [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool
