Steve_Friedman at nps.gov
2009-Jul-31 13:22 UTC
[R] Preparing for multi-core CPUs and parallel processing applications
Hello I am fortunate (or in really big trouble) in that the research group I work with will soon be receiving several high end dual quad core machines. We will use the Ubuntu OS on these. We intend to use this cluster for some extensive modeling applications. Our programming guru has demonstrated the ability to link much simpler machines to share CPUs and we purchased the new ones to take advantage of this option. We have also begun exploration of the R CUDA and J CUDA functionality to push the processes to the graphics CPU which greatly speeds up the numerical processing. My question(s) to this group: 1) Which packages are suitable for parallel processing applications in R ? 2) Are these packages ready for prime time applications or are they developmental at this time? 3) Are we better off working in Java or C++ for the majority of this simulation work and linking to R for statistical analysis? 4) What are the pit falls, if any, that I need to be aware of ? 5) Can we take advantage of sharing the graphics CPU, via R CUDA, in a parallel distributed shared cluster of dedicated machines ? 6) Our statistical analysis and modeling applications address very large geographic issues. We generally work with 30-40 year daily time step data in a grided format. The grid is approximate 250 x 400 cells in extent, each representing approximately 500 meters x 500 meters. To this we a very large suite of ancillary information, both spatial and non-spatial, to simulate a variety of ecological state conditions. My question is - is this too large for R , given its use of memory? 7) I currently have a laptop with Ubuntu with R Version 2.6.2 (2008-02-08). What is the most recent R version for Ubuntu and what is the installation procedure ? These are just the initial questions that I'm sure to have. If these are being directed to the wrong help pages, I'm sorry to have taken your time. If you would be so kind as to direct me to the more appropriate help site I'd appreciate your assistance. Thanks in advance, Steve Steve Friedman Ph. D. Spatial Statistical Analyst Everglades and Dry Tortugas National Park 950 N Krome Ave (3rd Floor) Homestead, Florida 33034 Steve_Friedman at nps.gov Office (305) 224 - 4282 Fax (305) 224 - 4147
Martin Morgan
2009-Jul-31 14:00 UTC
[R] Preparing for multi-core CPUs and parallel processing applications
Hi Steve -- Steve_Friedman at nps.gov wrote:> Hello > > I am fortunate (or in really big trouble) in that the research group I work > with will soon be receiving several high end dual quad core machines. We > will use the Ubuntu OS on these. We intend to use this cluster for some > extensive modeling applications. Our programming guru has demonstrated the > ability to link much simpler machines to share CPUs and we purchased the > new ones to take advantage of this option. We have also begun exploration > of the R CUDA and J CUDA functionality to push the processes to the > graphics CPU which greatly speeds up the numerical processing. > > My question(s) to this group:Last question first, the R-sig-hpc group might be more appropriate for an extended discussion. https://stat.ethz.ch/mailman/listinfo/r-sig-hpc see also the HighPerformanceComputing task view http://cran.fhcrc.org/web/views/HighPerformanceComputing.html> 1) Which packages are suitable for parallel processing applications in R > ? > 2) Are these packages ready for prime time applications or are they > developmental at this time?I use Rmpi for all my parallel computing, but if I had more time I'd explore multicore for more efficient use of several CPU on a single machine, and the new offerings from Revolution computing. If there were significant portions of C code I'd look into using openMP (as done in the pnmath library). Also using a parallel BLAS / LAPACK library if that was where significant computation was occurring.> 3) Are we better off working in Java or C++ for the majority of this > simulation work and linking to R for statistical analysis? > 4) What are the pit falls, if any, that I need to be aware of ?With multiple core, it's important to remember that large memory is divided amongst cpu, so that huge-sounding 32GB 8 core machine has 'only' 4 GB / cpu when independent R processes are allocated to each cpu (as is the style with Rmpi).> 5) Can we take advantage of sharing the graphics CPU, via R CUDA, in a > parallel distributed shared cluster of dedicated machines ? > > 6) Our statistical analysis and modeling applications address very large > geographic issues. We generally work with 30-40 year daily time step data > in a grided format. The grid is approximate 250 x 400 cells in extent, each > representing approximately 500 meters x 500 meters. To this we a very > large suite of ancillary information, both spatial and non-spatial, to > simulate a variety of ecological state conditions. My question is - is > this too large for R , given its use of memory?Depending on the application, large data sets can often be managed effectively on disk, e.g., by using the ncdf package (for large numeric data) or a data base (R includes sqlite, for instance), and analyzing independent 'slices'. This fits well with common parallel computing paradigms.> > 7) I currently have a laptop with Ubuntu with R Version 2.6.2 > (2008-02-08). What is the most recent R version for Ubuntu and what is the > installation procedure ? > > These are just the initial questions that I'm sure to have. If these are > being directed to the wrong help pages, I'm sorry to have taken your time. > If you would be so kind as to direct me to the more appropriate help site > I'd appreciate your assistance. > > Thanks in advance, > Steve > > > Steve Friedman Ph. D. > Spatial Statistical Analyst > Everglades and Dry Tortugas National Park > 950 N Krome Ave (3rd Floor) > Homestead, Florida 33034 > > Steve_Friedman at nps.gov > Office (305) 224 - 4282 > Fax (305) 224 - 4147 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Dirk Eddelbuettel
2009-Jul-31 14:26 UTC
[R] Preparing for multi-core CPUs and parallel processing applications
Steve, Martin already mentioned r-sig-hpc and the HPC task view for the bulk of your questions. The Schmidbger et al paper (linked from the Task View) should address a few of your questions. Just two more quick add-ons: On 31 July 2009 at 09:22, Steve_Friedman at nps.gov wrote: | 5) Can we take advantage of sharing the graphics CPU, via R CUDA, in a | parallel distributed shared cluster of dedicated machines ? Besides the somewhat exploratory package 'gputools' from U Mich (linked from the Task View), there is no 'R CUDA' yet. | 7) I currently have a laptop with Ubuntu with R Version 2.6.2 | (2008-02-08). What is the most recent R version for Ubuntu and what is the | installation procedure ? The newest is R 2.9.1, see http://cran.r-project/bin/linux/ubuntu which explains things in more detail. For more questions, come to the r-sig-debian list is for Debian and Ubuntu specific questions. Debian and Ubuntu do have good support for Rmpi etc. | These are just the initial questions that I'm sure to have. If these are | being directed to the wrong help pages, I'm sorry to have taken your time. | If you would be so kind as to direct me to the more appropriate help site | I'd appreciate your assistance. There are 'special interest group' mailing lists for HPC (see above), Debian/Ubuntu and for geographic / spatial modelling. Hth, Dirk -- Three out of two people have difficulties with fractions.