I am using R with Bioconductor to perform analyses on large datasets using bootstrap methods. In an attempt to speed up my work, I have inquired about using our local supercomputer and asked the administrator if he thought R would run faster on our parallel network. I received the following reply: "The second benefit is that the processors have large caches. Briefly, everything is loaded into cache before going into the processor. With large caches, there is less movement of data between memory and cache, and this can save quite a bit of time. Indeed, when programmers optimize code they usually think about how to do things to keep data in cache as long as possible. Whether you would receive any benefit from larger cache depends on how R is written. If it's written such that data remain in cache, the speed-up could be considerable, but I have no way to predict it." My question is, "is R written such that data remain in cache?" Thanks, Mark W. Kimpel MD Indiana University School of Medicine [[alternative HTML version deleted]]
In general, R is not written in such a way that data remain in cache. However, R can use optimized BLAS libraries, and these are. So if your version of R is compiled to use an optimized BLAS library appropriate to the machine (e.g., ATLAS, or Prof. Goto's Blas), AND a considerable amount of the computation done in your R program involves basic linear algebra (matrix multiplication, etc.), then you might see a good speedup. -- Tony Plate Kimpel, Mark William wrote:> I am using R with Bioconductor to perform analyses on large datasets > using bootstrap methods. In an attempt to speed up my work, I have > inquired about using our local supercomputer and asked the administrator > if he thought R would run faster on our parallel network. I received the > following reply: > > > > > > "The second benefit is that the processors have large caches. > > Briefly, everything is loaded into cache before going into the > processor. With large caches, there is less movement of data between > memory and cache, and this can save quite a bit of time. Indeed, when > programmers optimize code they usually think about how to do things to > keep data in cache as long as possible. > > Whether you would receive any benefit from larger cache depends on how > R is written. If it's written such that data remain in cache, the > speed-up could be considerable, but I have no way to predict it." > > > > My question is, "is R written such that data remain in cache?" > > > > Thanks, > > > > > > Mark W. Kimpel MD > > > > Indiana University School of Medicine > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
On 10/10/05 3:54 PM, "Kimpel, Mark William" <mkimpel at iupui.edu> wrote:> I am using R with Bioconductor to perform analyses on large datasets > using bootstrap methods. In an attempt to speed up my work, I have > inquired about using our local supercomputer and asked the administrator > if he thought R would run faster on our parallel network. I received the > following reply: > > > > > > "The second benefit is that the processors have large caches. > > Briefly, everything is loaded into cache before going into the > processor. With large caches, there is less movement of data between > memory and cache, and this can save quite a bit of time. Indeed, when > programmers optimize code they usually think about how to do things to > keep data in cache as long as possible. > > Whether you would receive any benefit from larger cache depends on how > R is written. If it's written such that data remain in cache, the > speed-up could be considerable, but I have no way to predict it." > > > > My question is, "is R written such that data remain in cache?"Using the cluster model (which may or may not be what you are calling a supercomputer--I don't know the exact terminology here), jobs that involve repetitive, independent tasks like computing statistics on bootstrap replicates can benefit from parallelization IF the "I/O" associated with running the single replicate does not outweigh the benefit of using multiple processors. For example, if you are running 10000 replicates and each takes 1 ms, then you have a 10 second job on a single processor. One could envision spreading that same process over 1000 processors and doing the job in 10 ms, but if one counts the I/O (network, moving into cache, etc.) which could take 1 second per batch of replicates (for example), then that job will take AT LEAST 10 seconds with 1000 processors, also. However, if the same computation takes 1 second per replicate, then the whole job takes 10,000 seconds on a single processor, but only about 11 seconds on the 1000 processors (approximately). This rationale is only approximate, but I hope it shows the point. We have begun to use a 60-node linux cluster for some of our work (also microarray-based) and use MPI/snow with very nice results for multiple independent, long-running tasks. Snow is VERY easy to use, but one could also drop back to the Rmpi if needed, to have finer-grain control over the parallelization process. As for how caching behaviors come into it and how R without "parallelized" R-code would perform, I can't really comment; my experience is limited to the "cluster" model with parallelized R-code. Sean