jeffc
2009-Nov-07 16:17 UTC
[R] solution design for a large scale (> 50G) R computing problem
Hi, I am tackling a computing problem in R that involves large data. Both time and memory issues need to be seriously considered. Below is the problem description and my tentative approach. I would appreciate if any one can share thoughts on how to solve this problem more efficiently. I have 1001 multidimensional arrays -- A, B1, ..., B1000. A takes about 500MB in memory and B_i takes 100MB. I need to run an experiment that evaluates a function f(A, B_i) for all B_i. f(A, B_i) doesn't change A, B_i during its evaluation. These evaluations are independent for all i. I also need to design various evaluation functions. Thus these kind of experiments need to be performed often. My computing environment is a 64bit Linux, 64GB memory, 8 core PC. My goal is to do multiple experiments quickly given the existing equipments. One possible approach is to run a R process that loads A and use a parallel library like foreach and mc to load B_i and compute f(A, B_i). The problems with this approach are that each time foreach splits a new process it has to 1) copy the whole A array and 2) load B_i from disk to memory using io. Since f(A, B_i) doesn't change A, B_i, would it be possible to do in R 1) share A across different processes and 2) use memory mapped file to load B_i (even A at the beginning) Any suggestions would be appreciated. Jeff -- View this message in context: http://old.nabble.com/solution-design-for-a-large-scale-%28%3E-50G%29-R-computing-problem-tp26241900p26241900.html Sent from the R help mailing list archive at Nabble.com.