thr3ads.net - R devel - [Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std) [Mar 2004]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2004-Mar-02 14:34 UTC

[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

Hi Martin,

When I attended the LinuxWorld Expo in NYC back in January, I chatted with
some folks at the AMD booth, as well as guys from Penguin Computing (where
we bought our Opteron box).  I was told that the Operton has this somewhat
strange setup that the memory is controlled by one CPU.  The net effect of
this being that when both CPUs are running, one might only be running at
around 90% instead of 99%.  The `NUMA' kernel is supposed to fix this
problem.  I wonder if this is related to the performance of the threaded
GOTO lib that you saw.  Has anyone tried the NUMA kernel?

Best,
Andy
> From: Martin Maechler
> 
> >>>>> "BDR" == Prof Brian Ripley
<ripley@stats.ox.ac.uk>
> >>>>>     on Fri, 27 Feb 2004 18:22:29 +0000 (GMT) writes:
> 
>     BDR> On 27 Feb 2004, Douglas Bates wrote:
>     >> Martin Maechler <maechler@stat.math.ethz.ch> writes:
>     >> 
>     >> > >>>>> "PD" == Peter Dalgaard
<p.dalgaard@biostat.ku.dk>
>     >> > >>>>>     on 26 Feb 2004 15:44:16 +0100
writes:
>     >> > 
>     >> >     PD> Douglas Bates <bates@stat.wisc.edu>
writes:
>     >> >     >> Have you tried configuring R with Goto's
BLAS
>     >> >     >> http://www.cs.utexas.edu/users/kgoto/
>     >> >     >> 
>     >> >     >> I haven't worked with Opteron or
Athlon64
> computers but I understand
>     >> >     >> that Goto's BLAS are very effective on
those
> machines.  Furthermore
>     >> >     >> Goto's BLAS are (only) available as .so 
> libraries so you don't need to
>     >> >     >> mess with creating the .so version.
>     >> > 
>     >> >     PD> I tried it, yes. Somewhat to my surprise, it 
> seemed to be not quite as
>     >> >     PD> fast as the threaded ATLAS, but I wasn't 
> very systematic about the
>     >> >     PD> benchmarking.
>     >> > 
>     >> >     PD> (and the Goto items have license issues, 
> which get in the way for
>     >> >     PD> binary distributions.)
>     >> > 
>     >> > Thanks a lot, Peter, Brian, Doug, for your feedbacks!
>     >> > In the mean time, I have three running versions of 
> R(-devel) on
>     >> > the 64-Opteron
>     >> > - "plain"
>     >> > - linked against threaded GOTO
>     >> > - linked against threaded (static) ATLAS  (using 
> -fPIC for compilation;
>     >> > 					   "large" Rlapack)
>     >> > and I find that GOTO is faster than ATLAS
>     >> > consistently (between ~ 5-20%) for several tests
>     >> > (square matrices; %*% and solve).
>     >> > ATLAS is still an order of magnitude faster than
"plain" for
>     >> > 3000x3000 matrices.
>     >> 
>     >> Would you be willing to post a brief summary of 
> comparative timings?
>     >> 
>     >> I have thought at times that it may be worthwhile collecting
>     >> comparative timings for different combinations of
>     >> processor/OS/memory size and speed/
>     >> on "typical" tasks in R.  As with any benchmark the 
> results will
>     >> artificial but they can be of some help when 
> considering what hardware
>     >> to purchase.  Bioconductor users may find it 
> particularly helpful to
>     >> be able to evaluate how much they will need to pay to 
> be able to
>     >> analyze large data sets reasonably quickly.
>     >> 
>     >> One easily-obtained timing is at the end of
>     >> $RSRC/tests/Examples/base-Ex.Rout after 'make;make
check'.
> 
>     BDR> That one is I think rather too artificial, as it 
> contains few even
>     BDR> moderately large examples, and is dominated by a few 
> atypical tasks.
> 
>     BDR> I tend to use the sum of the MASS scripts as an
>     BDR> informal timing: ch06.R is also a pretty good indicator.
> 
>     BDR> I think you will find that BLAS differences are pretty
>     BDR> small in real-life analyses, or at least I always have.
> 
> I've now done a bit more systematic testing using more realistic
> code than the large-matrix (1000^2 and 3000^2)
> number crunching I did last week.
> 
> As expected, the differences disappear for VR/scripts "ch06.R"
> (there's even a slight indication of GOTO being worse than no
>  optimized BLAS, but probably that was a random fluctuation) and
> also for the "make check" outputs.
> 
> Here is a nice R function that can be used by others as well for
> getting the numbers for the "make check" (or better "make
> check-all") outputs.  Note that it's interesting to also get the
> times for the recommended packages.
> 
> #### After  "make check-all"  there are quite some files with
timings
> ####         --------------
> #### Get at these
> 
> ## In a Unix shell, it's as simple as
> ##  cd `R RHOME`/tests
> ##  grep '^Time elapsed' *.Rout Examples/*.Rout *.Rcheck/*.Rout
> 
> checkTimes <- function(Rhome = R.home())
> {
>     ## Purpose: Collect the "Time elapsed" timings of R's   
> "make check-all"
>     ##          into a numeric  N x 3 matrix (with rownames!)
>     ## 
> ----------------------------------------------------------------------
>     ## Author: Martin Maechler, Date:  1 Mar 2004, 15:27
> 
>     tDir <- file.path(Rhome,"tests")
>     dirLs <- c(tDir, file.path(tDir,"Examples"),
>                file.path(tDir, list.files(tDir, 
> pattern="\\.Rcheck$")))
>     iniStr <- "^Time elapsed:"
>     endPat <- "\\.Rout$"
>     ir <- length(rr <- list())
>     for(d in dirLs) {
>         files <- list.files(d, pattern = endPat)
>         for(f in files) {
>             lls <- readLines(file.path(d,f))
>             if(length(i <- grep(iniStr, lls))) {
>                 tC <- textConnection(sub(iniStr,'', lls[i]))
>                 nCPU <- scan(tC, quiet=TRUE)
>                 close(tC)
>                 f <- sub(endPat,'', f)
>                 rr[[(ir <- ir+1)]] <- list(f, nCPU[1:3])
>             }
>         }
>     }
>     ## tranform list to matrix
>     t(matrix(sapply(rr,"[[",2), 3, length(rr),
>              dimnames = list(NULL, sapply(rr,"[[",1))))
> }
> 
> -----------
> 
> Now I did measure on the AMD Opteron (64-bit, dual proc; 4 GB RAM)
> 
> rM <- checkTimes()
> nn <- nrow(rM)
> ## Look at the values --- in sorted order
> iS <- sort.list(rM[,1], decreasing = TRUE)
> rM[iS ,]
> plot(rM[iS, 3] / rM[iS,1])
> ## not systematically looking --> only use "CPU[1]"
> plot(rM[iS, 1], type = 'h', xaxt = "n", xlab =
'', ylab =
> "Time elapsed",
>      main = paste("CPU used for checks in", tDir))
> 
> rM.A <- checkTimes("/usr/local/app/R/R-devel-ATLAS-inst")
> rM.G <- checkTimes("/usr/local/app/R/R-devel-GOTO-inst")
> rM.s <- checkTimes("/usr/local/app/R/R-devel-inst")
> iS <- sort.list(rM.A[,1], decreasing = TRUE)
> 
> cbind(ATLAS = rM.A[iS,1],
>       GOTO  = rM.G[iS,1],
>       std   = rM.s[iS,1])
> ## gives
> ##                  ATLAS  GOTO   std
> ## boot-Ex          73.38 73.71 73.62
> ## nlme-Ex          31.92 34.18 31.91
> ## mgcv-Ex          29.20 31.69 29.35
> ## MASS-Ex          21.54 20.49 20.29
> ## stats-Ex         17.80 17.69 17.91
> ## lattice-Ex       11.38 11.37 11.05
> ## methods-Ex        6.87  6.53  6.58
> ## base-Ex           5.48  5.28  5.26
> ## graphics-Ex       4.71  4.73  4.70
> ## tools-Ex          3.86  3.66  3.82
> ## cluster-Ex        3.78  3.74  3.65
> ## utils-Ex          2.73  2.60  2.60
> ## p-r-random-tests  2.60  2.58  2.55
> ## survival-Ex       2.48  2.49  2.30
> ## ...
> ## .........
> 
> ## Graphic:
> pdf("CPU-checks.pdf")
> plot(rM.A[iS, 1], type = 'h', xaxt = "n", xlab =
'', ylab =
> "Time elapsed",
>      main= "AMD Opteron 246: CPU for R 'make check-all' tests 
> & Examples")
> iS. <- iS[1:12]
> text(1:12, rM.A[iS., 1], rownames(rM)[iS.], adj = c(-.15, 
> -.15), cex = 0.8)
> points(1:nn+.1, rM.G[iS, 1], type = 'h', col=2)
> points(1:nn+.2, rM.s[iS, 1], type = 'h', col=3)
> legend(par("usr")[2], par("usr")[4],
c("ATLAS", "GOTO", " std  "),
>        col=1:3, lwd=1, xjust=1.1, yjust=1.1)
> if(.Device == "pdf") dev.off()
> 
> ### Are ATLAS or GOTO better than "standard":
> matplot(1:nn, cbind(rM.A[iS,1]/ rM.s[iS,1],
>                     rM.G[iS,1]/ rM.s[iS,1]), type ='p', col=1:8)
> abline(h = 1, lty=3, col = "gray")
> ## to the contrary!  the points would have to be *below* 1 
> and are rather above
> 
> -------------------
> 
> The PDF graphic is available as 
>   ftp://ftp.stat.math.ethz.ch/U/maechler/R/CPU-checks.pdf
> 
> ---
> 
> When I however run something like the following
> "non-small" lm problem, 
> 
> --------------------------------------------------------------
> ---------------
> 
> ### Take a relative large  model.matrix() --- as in ./predict-lm.R
> ### "R BATCH --vanilla <this>"
> 
> if(paste(R.version$major, R.version$minor, sep=".") >= 1.7)
>     RNGversion("1.6")
> set.seed(47)
> 
> ## Here: Want usual "noisy" model; almost no printing
> n <- 5000
> x <- rnorm(n)
> ldat <-
>     data.frame(x1 = x,
>                x2 = sort(5*x - rnorm(n)),
>                f1 = factor(pmin(12, rpois(n, lam=  5))),
>                f2 = factor(pmin(20, rpois(n, lam=  9))),
>                f3 = factor(pmin(32, rpois(n, lam= 12))))
> with(ldat,
>      ldat$y <<- 10 + 4*x1 + 2*x2 + rnorm(n) +
>      ## no rounding here:
>      + 10 * rnorm(nlevels(f1))[f1] +
>      + 100* rnorm(nlevels(f2))[f2])
> str(ldat)
> 
> mylm <- lm(y ~ .^2, data = ldat)
> proc.time() ## (~= 100 sec on  P4 1.6 GHz "lynne")
> str(mm <- model.matrix(mylm))
> smlm <- summary(mylm)
> 
> p1 <- predict(mylm)
> p2 <- predict(mylm, type = "terms")
> 
> str(myim <- influence.measures(mylm))
> 
> ## R BATCH gives another "total"  proc.time() here:
> 
> --------------------------------------------------------------
> ---------------
> 
> Things look a bit different :
> 
> Timings (the first 3 of proc.time()) 
> 	-- ATLAS measured only 3x, the others 5x :
> 
> 1. after lm()  # grep -n '^\[1\] [^1]' lm-tst-2.Rout-opteron-*
> 
> ATLAS:
>   34.56  0.56 35.57
>   33.90  0.59 34.57
>   34.55  0.61 35.33
> GOTO:
>   28.17  1.82 34.68
>   29.13  1.61 35.56
>   26.90  2.05 32.99
>   28.11  1.83 34.64
>   28.26  1.92 34.90
> std:
>   34.61  0.62 35.62
>   33.46  0.61 34.26
>   34.79  0.65 35.58
>   33.78  0.67 34.62
>   35.49  0.70 36.37
> 
> 2. total for the above R script # grep -n '^\[1\] 1' 
> lm-tst-2.Rout-opteron-*
> 
> ATLAS:
>   127.71   1.56 129.92
>   130.42   1.66 132.28
>   131.89   1.39 133.57
> GOTO:
>   129.51  25.17 212.02
>   129.56  26.93 215.06
>   137.36  27.43 221.95
>   139.83  28.76 226.64
>   137.40  27.98 221.86
> std:
>   159.58   1.59 161.88
>   155.65   1.48 157.59
>   159.01   1.67 161.21
>   167.13   1.57 168.97
>   166.70   1.58 168.70
> 
> Which is a bit confusing to me:
>   The picture differs considerably if I "believe" the first
>   number proc.time(), say PT[1], or the third one, PT[3].
>   
>   Only using PT[1] - which I usually have done -
>   may be quite wrong here: Contrary to ATLAS and "std",
>   GOTO has a difference between PT[3] and PT[1], which may be
>   because of the way threading and the use of the two CPUs happen: 
> 
>  PT[1]:
>     GOTO is about 20% faster than ATLAS (which is
>     basically the same as "standard", i.e. R-internal
BLAS/LAPACK)
>     for the first  lm() measurement,
>     but for the overall time {which adds summary.lm(), 
> influence.lm() etc}
>     GOTO and ATLAS are basically the same speed, both 20% faster
>     than "standard":
> 
>  PT[3]:
>     For the lm() part itself: no difference
>     For the total: ATLAS >> std >> GOTO  
> 		   ~~~~~~~~~~~~~~~~~~~~ (' >> ' := "clearly better
than)
> 
> ---
> 
> Comments welcome
> 
> Martin Maechler <maechler@stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel



------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Peter Dalgaard

2004-Mar-02 16:21 UTC

head link

[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

"Liaw, Andy" <andy_liaw@merck.com> writes:
> Hi Martin,
> 
> When I attended the LinuxWorld Expo in NYC back in January, I chatted with
> some folks at the AMD booth, as well as guys from Penguin Computing (where
> we bought our Opteron box).  I was told that the Operton has this somewhat
> strange setup that the memory is controlled by one CPU.  The net effect of
> this being that when both CPUs are running, one might only be running at
> around 90% instead of 99%.  The `NUMA' kernel is supposed to fix this
> problem.  I wonder if this is related to the performance of the threaded
> GOTO lib that you saw.  Has anyone tried the NUMA kernel?
> 
> Best,
> Andy
My understanding is slightly different (I could be wrong though, I'm
hardly a hardware engineer): Each CPU controls one block of memory,
and only some motherboard have memory slots for both CPUs. If CPU2
wants to talk to CPU1's memory it has to ask CPU1 for it, with the
obvious potential for a performance hit.

I'll see if I can get around to redoing my Opteron builds and trying
Martin's benchmarks in the next couple of days.

        -p

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Liaw, Andy

2004-Mar-02 20:25 UTC

head link

[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

My understanding is far below `imperfect'!  However, our box has all 8 slots
filled with 2GB PC2100s...  Not a whole lot of room for change.

Best,
Andy
> From: rossini@blindglobe.net [mailto:rossini@blindglobe.net] 
> 
> And my understanding is completely imperfect, but note that
> organization (and selection) of memory modules on the motherboard can
> be critical for speed with the opterons; well known issue on the
> beowulf lists.
> 
> best,
> -tony
> 
> Peter Dalgaard <p.dalgaard@biostat.ku.dk> writes:
> 
> > "Liaw, Andy" <andy_liaw@merck.com> writes:
> >
> >> Hi Martin,
> >> 
> >> When I attended the LinuxWorld Expo in NYC back in 
> January, I chatted with
> >> some folks at the AMD booth, as well as guys from Penguin 
> Computing (where
> >> we bought our Opteron box).  I was told that the Operton 
> has this somewhat
> >> strange setup that the memory is controlled by one CPU.  
> The net effect of
> >> this being that when both CPUs are running, one might only 
> be running at
> >> around 90% instead of 99%.  The `NUMA' kernel is supposed 
> to fix this
> >> problem.  I wonder if this is related to the performance 
> of the threaded
> >> GOTO lib that you saw.  Has anyone tried the NUMA kernel?
> >> 
> >> Best,
> >> Andy
> >
> > My understanding is slightly different (I could be wrong though,
I'm
> > hardly a hardware engineer): Each CPU controls one block of memory,
> > and only some motherboard have memory slots for both CPUs. If CPU2
> > wants to talk to CPU1's memory it has to ask CPU1 for it, with the
> > obvious potential for a performance hit.
> >
> > I'll see if I can get around to redoing my Opteron builds and
trying
> > Martin's benchmarks in the next couple of days.
> >
> >         -p
> >
> > -- 
> >    O__  ---- Peter Dalgaard             Blegdamsvej 3  
> >   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
> >  (*) \(*) -- University of Copenhagen   Denmark      Ph: 
> (+45) 35327918
> > ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: 
> (+45) 35327907
> >
> -- 
> rossini@u.washington.edu            
> http://www.analytics.washington.edu/ 
> Biomedical and Health 
> Informatics   University of Washington
> Biostatistics, SCHARP/HVTN          Fred Hutchinson Cancer 
> Research Center
> UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
> FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Mar 2004 - Some timings for 64 bit Opteron (ATLAS, GOTO, std)

[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

Seemingly Similar Threads