Paul Johnson
2010-Aug-19 18:49 UTC
[Rd] Can you share a working example of R program aided by fast BLAS?
Can one of you give me an R program that displays the benefits an accelerated BLAS in R? Here's why I ask, in case you wonder: In a linux cluster, I've hit some bumps in the road. The worst one by far was that I installed R, then GotoBLAS2 with default settings, and after that, jobs using Rmpi were *really* *really* slow. I mean horrible. If a job took 15 minutes when run by itself, outside of MPI, it took 1 full day when run inside MPI. Literally the same job. I learned later that GotoBLAS2 defaults to allow threads equal to the number of cores, and that the threads are not compatible with MPI. This latter point not clearly stated in the GotoBLAS2 documents, so far as I can find, but after I realized that was the problem, I did find one other cluster website that mentioned the same problem. "If your application uses GotoBLAS and all cores as MPI threads, setting GOTO_NUM_THREADS larger than one will usually result in drastically slower performance." (http://hpc.uark.edu/hpc/support/software/numerical.html#gotoblas). In the GotoBLAS2 documentation, it warns of weird thread related delays, but it implies that the slowdown--if it happens--is a result of bad user code, rather than this more fundamental mismatch between OpenMPI (or MPI in general) and GotoBLAS2. In the process of diagnosing the big slowdown, I've been making many time comparisons. When I installed GotoBLAS2 in the first place, it was because so many people (and the R admin manual) said that R's ordinary BLAS is rudimentary/slow. In the test cases I've tried, R's BLAS is not that bad. In fact, in the test programs we run, the time is not substantially different with GotoBLAS2 and R's BLAS. I also compared the Intel Kernel Math Library BLAS and didn't notice a huge difference. So, well, I think that means I'm running bad test cases for R and GotoBLAS2. Oh, and one more thing. I have not been able to find an example R program that benefitted at all from allowing threads > 1 in GotoBLAS2 environment settings. In fact, if a one-thread job takes15 minutes, the one that allows 2 or more threads is 21 minutes. And the more threads allowed causes a job to take longer. This is literally the same job, same cluster node, the only difference is changing the environment variable that adjusts the GotoBLAS2 threads allowed. So if you know whether your example depends on threads or not, I would appreciate the warning. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
Allan Engelhardt
2010-Aug-26 18:42 UTC
[Rd] Can you share a working example of R program aided by fast BLAS?
On 19/08/10 19:49, Paul Johnson wrote:> Can one of you give me an R program that displays the benefits an > accelerated BLAS in R? >I thought the standard benchmark was (the somewhat artificial) R-benchmark-25.R from http://r.research.att.com/benchmarks/R-benchmark-25.R . I have some examples using this to show the benefit of an external (optimized, multi-threaded) BLAS at http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html . (But I have also seen R run slower with external BLAS, e.g. with Radford Neal's benchmarks in a recent post here on ?Speeding up matrix multiplies?. YMMV.) Allan