Hi Martin, When I attended the LinuxWorld Expo in NYC back in January, I chatted with some folks at the AMD booth, as well as guys from Penguin Computing (where we bought our Opteron box). I was told that the Operton has this somewhat strange setup that the memory is controlled by one CPU. The net effect of this being that when both CPUs are running, one might only be running at around 90% instead of 99%. The `NUMA' kernel is supposed to fix this problem. I wonder if this is related to the performance of the threaded GOTO lib that you saw. Has anyone tried the NUMA kernel? Best, Andy> From: Martin Maechler > > >>>>> "BDR" == Prof Brian Ripley <ripley@stats.ox.ac.uk> > >>>>> on Fri, 27 Feb 2004 18:22:29 +0000 (GMT) writes: > > BDR> On 27 Feb 2004, Douglas Bates wrote: > >> Martin Maechler <maechler@stat.math.ethz.ch> writes: > >> > >> > >>>>> "PD" == Peter Dalgaard <p.dalgaard@biostat.ku.dk> > >> > >>>>> on 26 Feb 2004 15:44:16 +0100 writes: > >> > > >> > PD> Douglas Bates <bates@stat.wisc.edu> writes: > >> > >> Have you tried configuring R with Goto's BLAS > >> > >> http://www.cs.utexas.edu/users/kgoto/ > >> > >> > >> > >> I haven't worked with Opteron or Athlon64 > computers but I understand > >> > >> that Goto's BLAS are very effective on those > machines. Furthermore > >> > >> Goto's BLAS are (only) available as .so > libraries so you don't need to > >> > >> mess with creating the .so version. > >> > > >> > PD> I tried it, yes. Somewhat to my surprise, it > seemed to be not quite as > >> > PD> fast as the threaded ATLAS, but I wasn't > very systematic about the > >> > PD> benchmarking. > >> > > >> > PD> (and the Goto items have license issues, > which get in the way for > >> > PD> binary distributions.) > >> > > >> > Thanks a lot, Peter, Brian, Doug, for your feedbacks! > >> > In the mean time, I have three running versions of > R(-devel) on > >> > the 64-Opteron > >> > - "plain" > >> > - linked against threaded GOTO > >> > - linked against threaded (static) ATLAS (using > -fPIC for compilation; > >> > "large" Rlapack) > >> > and I find that GOTO is faster than ATLAS > >> > consistently (between ~ 5-20%) for several tests > >> > (square matrices; %*% and solve). > >> > ATLAS is still an order of magnitude faster than "plain" for > >> > 3000x3000 matrices. > >> > >> Would you be willing to post a brief summary of > comparative timings? > >> > >> I have thought at times that it may be worthwhile collecting > >> comparative timings for different combinations of > >> processor/OS/memory size and speed/ > >> on "typical" tasks in R. As with any benchmark the > results will > >> artificial but they can be of some help when > considering what hardware > >> to purchase. Bioconductor users may find it > particularly helpful to > >> be able to evaluate how much they will need to pay to > be able to > >> analyze large data sets reasonably quickly. > >> > >> One easily-obtained timing is at the end of > >> $RSRC/tests/Examples/base-Ex.Rout after 'make;make check'. > > BDR> That one is I think rather too artificial, as it > contains few even > BDR> moderately large examples, and is dominated by a few > atypical tasks. > > BDR> I tend to use the sum of the MASS scripts as an > BDR> informal timing: ch06.R is also a pretty good indicator. > > BDR> I think you will find that BLAS differences are pretty > BDR> small in real-life analyses, or at least I always have. > > I've now done a bit more systematic testing using more realistic > code than the large-matrix (1000^2 and 3000^2) > number crunching I did last week. > > As expected, the differences disappear for VR/scripts "ch06.R" > (there's even a slight indication of GOTO being worse than no > optimized BLAS, but probably that was a random fluctuation) and > also for the "make check" outputs. > > Here is a nice R function that can be used by others as well for > getting the numbers for the "make check" (or better "make > check-all") outputs. Note that it's interesting to also get the > times for the recommended packages. > > #### After "make check-all" there are quite some files with timings > #### -------------- > #### Get at these > > ## In a Unix shell, it's as simple as > ## cd `R RHOME`/tests > ## grep '^Time elapsed' *.Rout Examples/*.Rout *.Rcheck/*.Rout > > checkTimes <- function(Rhome = R.home()) > { > ## Purpose: Collect the "Time elapsed" timings of R's > "make check-all" > ## into a numeric N x 3 matrix (with rownames!) > ## > ---------------------------------------------------------------------- > ## Author: Martin Maechler, Date: 1 Mar 2004, 15:27 > > tDir <- file.path(Rhome,"tests") > dirLs <- c(tDir, file.path(tDir,"Examples"), > file.path(tDir, list.files(tDir, > pattern="\\.Rcheck$"))) > iniStr <- "^Time elapsed:" > endPat <- "\\.Rout$" > ir <- length(rr <- list()) > for(d in dirLs) { > files <- list.files(d, pattern = endPat) > for(f in files) { > lls <- readLines(file.path(d,f)) > if(length(i <- grep(iniStr, lls))) { > tC <- textConnection(sub(iniStr,'', lls[i])) > nCPU <- scan(tC, quiet=TRUE) > close(tC) > f <- sub(endPat,'', f) > rr[[(ir <- ir+1)]] <- list(f, nCPU[1:3]) > } > } > } > ## tranform list to matrix > t(matrix(sapply(rr,"[[",2), 3, length(rr), > dimnames = list(NULL, sapply(rr,"[[",1)))) > } > > ----------- > > Now I did measure on the AMD Opteron (64-bit, dual proc; 4 GB RAM) > > rM <- checkTimes() > nn <- nrow(rM) > ## Look at the values --- in sorted order > iS <- sort.list(rM[,1], decreasing = TRUE) > rM[iS ,] > plot(rM[iS, 3] / rM[iS,1]) > ## not systematically looking --> only use "CPU[1]" > plot(rM[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = > "Time elapsed", > main = paste("CPU used for checks in", tDir)) > > rM.A <- checkTimes("/usr/local/app/R/R-devel-ATLAS-inst") > rM.G <- checkTimes("/usr/local/app/R/R-devel-GOTO-inst") > rM.s <- checkTimes("/usr/local/app/R/R-devel-inst") > iS <- sort.list(rM.A[,1], decreasing = TRUE) > > cbind(ATLAS = rM.A[iS,1], > GOTO = rM.G[iS,1], > std = rM.s[iS,1]) > ## gives > ## ATLAS GOTO std > ## boot-Ex 73.38 73.71 73.62 > ## nlme-Ex 31.92 34.18 31.91 > ## mgcv-Ex 29.20 31.69 29.35 > ## MASS-Ex 21.54 20.49 20.29 > ## stats-Ex 17.80 17.69 17.91 > ## lattice-Ex 11.38 11.37 11.05 > ## methods-Ex 6.87 6.53 6.58 > ## base-Ex 5.48 5.28 5.26 > ## graphics-Ex 4.71 4.73 4.70 > ## tools-Ex 3.86 3.66 3.82 > ## cluster-Ex 3.78 3.74 3.65 > ## utils-Ex 2.73 2.60 2.60 > ## p-r-random-tests 2.60 2.58 2.55 > ## survival-Ex 2.48 2.49 2.30 > ## ... > ## ......... > > ## Graphic: > pdf("CPU-checks.pdf") > plot(rM.A[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = > "Time elapsed", > main= "AMD Opteron 246: CPU for R 'make check-all' tests > & Examples") > iS. <- iS[1:12] > text(1:12, rM.A[iS., 1], rownames(rM)[iS.], adj = c(-.15, > -.15), cex = 0.8) > points(1:nn+.1, rM.G[iS, 1], type = 'h', col=2) > points(1:nn+.2, rM.s[iS, 1], type = 'h', col=3) > legend(par("usr")[2], par("usr")[4], c("ATLAS", "GOTO", " std "), > col=1:3, lwd=1, xjust=1.1, yjust=1.1) > if(.Device == "pdf") dev.off() > > ### Are ATLAS or GOTO better than "standard": > matplot(1:nn, cbind(rM.A[iS,1]/ rM.s[iS,1], > rM.G[iS,1]/ rM.s[iS,1]), type ='p', col=1:8) > abline(h = 1, lty=3, col = "gray") > ## to the contrary! the points would have to be *below* 1 > and are rather above > > ------------------- > > The PDF graphic is available as > ftp://ftp.stat.math.ethz.ch/U/maechler/R/CPU-checks.pdf > > --- > > When I however run something like the following > "non-small" lm problem, > > -------------------------------------------------------------- > --------------- > > ### Take a relative large model.matrix() --- as in ./predict-lm.R > ### "R BATCH --vanilla <this>" > > if(paste(R.version$major, R.version$minor, sep=".") >= 1.7) > RNGversion("1.6") > set.seed(47) > > ## Here: Want usual "noisy" model; almost no printing > n <- 5000 > x <- rnorm(n) > ldat <- > data.frame(x1 = x, > x2 = sort(5*x - rnorm(n)), > f1 = factor(pmin(12, rpois(n, lam= 5))), > f2 = factor(pmin(20, rpois(n, lam= 9))), > f3 = factor(pmin(32, rpois(n, lam= 12)))) > with(ldat, > ldat$y <<- 10 + 4*x1 + 2*x2 + rnorm(n) + > ## no rounding here: > + 10 * rnorm(nlevels(f1))[f1] + > + 100* rnorm(nlevels(f2))[f2]) > str(ldat) > > mylm <- lm(y ~ .^2, data = ldat) > proc.time() ## (~= 100 sec on P4 1.6 GHz "lynne") > str(mm <- model.matrix(mylm)) > smlm <- summary(mylm) > > p1 <- predict(mylm) > p2 <- predict(mylm, type = "terms") > > str(myim <- influence.measures(mylm)) > > ## R BATCH gives another "total" proc.time() here: > > -------------------------------------------------------------- > --------------- > > Things look a bit different : > > Timings (the first 3 of proc.time()) > -- ATLAS measured only 3x, the others 5x : > > 1. after lm() # grep -n '^\[1\] [^1]' lm-tst-2.Rout-opteron-* > > ATLAS: > 34.56 0.56 35.57 > 33.90 0.59 34.57 > 34.55 0.61 35.33 > GOTO: > 28.17 1.82 34.68 > 29.13 1.61 35.56 > 26.90 2.05 32.99 > 28.11 1.83 34.64 > 28.26 1.92 34.90 > std: > 34.61 0.62 35.62 > 33.46 0.61 34.26 > 34.79 0.65 35.58 > 33.78 0.67 34.62 > 35.49 0.70 36.37 > > 2. total for the above R script # grep -n '^\[1\] 1' > lm-tst-2.Rout-opteron-* > > ATLAS: > 127.71 1.56 129.92 > 130.42 1.66 132.28 > 131.89 1.39 133.57 > GOTO: > 129.51 25.17 212.02 > 129.56 26.93 215.06 > 137.36 27.43 221.95 > 139.83 28.76 226.64 > 137.40 27.98 221.86 > std: > 159.58 1.59 161.88 > 155.65 1.48 157.59 > 159.01 1.67 161.21 > 167.13 1.57 168.97 > 166.70 1.58 168.70 > > Which is a bit confusing to me: > The picture differs considerably if I "believe" the first > number proc.time(), say PT[1], or the third one, PT[3]. > > Only using PT[1] - which I usually have done - > may be quite wrong here: Contrary to ATLAS and "std", > GOTO has a difference between PT[3] and PT[1], which may be > because of the way threading and the use of the two CPUs happen: > > PT[1]: > GOTO is about 20% faster than ATLAS (which is > basically the same as "standard", i.e. R-internal BLAS/LAPACK) > for the first lm() measurement, > but for the overall time {which adds summary.lm(), > influence.lm() etc} > GOTO and ATLAS are basically the same speed, both 20% faster > than "standard": > > PT[3]: > For the lm() part itself: no difference > For the total: ATLAS >> std >> GOTO > ~~~~~~~~~~~~~~~~~~~~ (' >> ' := "clearly better than) > > --- > > Comments welcome > > Martin Maechler <maechler@stat.math.ethz.ch>http://stat.ethz.ch/~maechler/ Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1228 <>< ______________________________________________ R-devel@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
Peter Dalgaard
2004-Mar-02 16:21 UTC
[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)
"Liaw, Andy" <andy_liaw@merck.com> writes:> Hi Martin, > > When I attended the LinuxWorld Expo in NYC back in January, I chatted with > some folks at the AMD booth, as well as guys from Penguin Computing (where > we bought our Opteron box). I was told that the Operton has this somewhat > strange setup that the memory is controlled by one CPU. The net effect of > this being that when both CPUs are running, one might only be running at > around 90% instead of 99%. The `NUMA' kernel is supposed to fix this > problem. I wonder if this is related to the performance of the threaded > GOTO lib that you saw. Has anyone tried the NUMA kernel? > > Best, > AndyMy understanding is slightly different (I could be wrong though, I'm hardly a hardware engineer): Each CPU controls one block of memory, and only some motherboard have memory slots for both CPUs. If CPU2 wants to talk to CPU1's memory it has to ask CPU1 for it, with the obvious potential for a performance hit. I'll see if I can get around to redoing my Opteron builds and trying Martin's benchmarks in the next couple of days. -p -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
My understanding is far below `imperfect'! However, our box has all 8 slots filled with 2GB PC2100s... Not a whole lot of room for change. Best, Andy> From: rossini@blindglobe.net [mailto:rossini@blindglobe.net] > > And my understanding is completely imperfect, but note that > organization (and selection) of memory modules on the motherboard can > be critical for speed with the opterons; well known issue on the > beowulf lists. > > best, > -tony > > Peter Dalgaard <p.dalgaard@biostat.ku.dk> writes: > > > "Liaw, Andy" <andy_liaw@merck.com> writes: > > > >> Hi Martin, > >> > >> When I attended the LinuxWorld Expo in NYC back in > January, I chatted with > >> some folks at the AMD booth, as well as guys from Penguin > Computing (where > >> we bought our Opteron box). I was told that the Operton > has this somewhat > >> strange setup that the memory is controlled by one CPU. > The net effect of > >> this being that when both CPUs are running, one might only > be running at > >> around 90% instead of 99%. The `NUMA' kernel is supposed > to fix this > >> problem. I wonder if this is related to the performance > of the threaded > >> GOTO lib that you saw. Has anyone tried the NUMA kernel? > >> > >> Best, > >> Andy > > > > My understanding is slightly different (I could be wrong though, I'm > > hardly a hardware engineer): Each CPU controls one block of memory, > > and only some motherboard have memory slots for both CPUs. If CPU2 > > wants to talk to CPU1's memory it has to ask CPU1 for it, with the > > obvious potential for a performance hit. > > > > I'll see if I can get around to redoing my Opteron builds and trying > > Martin's benchmarks in the next couple of days. > > > > -p > > > > -- > > O__ ---- Peter Dalgaard Blegdamsvej 3 > > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > > ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: > (+45) 35327907 > > > -- > rossini@u.washington.edu > http://www.analytics.washington.edu/ > Biomedical and Health > Informatics University of Washington > Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer > Research Center > UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable > FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}