Andrew Crane-Droesch
2015-Dec-30 17:36 UTC
[R] Thread parallelism and memory management on shared-memory supercomputers
I've got allocations on a couple of shared memory supercomputers, which I use to run computationally-intensive scripts on multiple cores of the same node. I've got 24 cores on the one, and 48 on the other. In both cases, there is a hard memory limit, which is shared among the cores in the node. In the latter, the limit is 255G. If my job requests more than that, the job gets aborted. Now, I don't fully understand resource allocation in these sorts of systems. But I do get that the sort of "thread parallelism" done by e.g. the `parallel` package in R isn't identical to the sort of parallelism commonly done in lower-level languages. For example, when I request a node, I only ask for one of its cores. My R script then detects the number of cores on the node, and farms out tasks to the cores via the `foreach` package. My understanding is that lower-level languages need the number of cores to be specified in the shell script, and a particular job script is given directly to each worker. My problem is that my parallel-calling R script is crashing the cluster, which terminates my script because the sum of the memory being requested by each thread is greater than what I'm allocated. I don't get this problem when running on my laptop's 4 cores, presumably because my laptop has a higher ratio of memory/core. My question: how can I ensure that the total memory being requested by N workers remains below a certain threshold? Is this even possible? If not, is it possible to benchmark a process locally, collecting the maximum per-worker memory requested, and use this to back out the number of workers that I can request for a given node's memory limit? Thanks in advance!
Peter Langfelder
2015-Dec-30 18:44 UTC
[R] Thread parallelism and memory management on shared-memory supercomputers
I'm not really an expert, but here are my 2 cents: To the best of my limited knowlede, there is no direct way of ensuring that the total memory being requested by N workers remains below a certain threshold. You can control the number of child processes forked by foreach/doPar in the registerDoParallel call using argument 'cores'. The parallel computation implemented in parallel and foreach/doPar uses process forking (at least last time I checked it did). When a process is forked, the entire memory of its parent is "forked" as well (not sure what the right terms is). This does not mean a real copy (modern systems use copy-on-write), but for the OS memory management purposes each child occupies as much memory as the parent. If you want to benchmark your memory usage, run a single (non-forked) process and at the end, look at the output of gc() which gives you, among other things, maximum memory usage. For a more detailed information on memory usage, you can run Rprof, tracemem, or Rprofmem, see their help for details. To decrease memory usage, you will have to optimize your code and perhaps sprinkle in garbage collection (gc()) calls after large object manipulations. Just be aware that garbage collection is rather slow, so you don't want to do it too often. The difference between the cluster and your laptop may be that on the laptop the system doesn't care so much about how much memory each child uses, so you can fork a process with a large memory footprint as long as you don't cause copying by modifying large chunks of memory. HTH, Peter On Wed, Dec 30, 2015 at 9:36 AM, Andrew Crane-Droesch <andrewcd at gmail.com> wrote:> I've got allocations on a couple of shared memory supercomputers, which I > use to run computationally-intensive scripts on multiple cores of the same > node. I've got 24 cores on the one, and 48 on the other. > > In both cases, there is a hard memory limit, which is shared among the cores > in the node. In the latter, the limit is 255G. If my job requests more than > that, the job gets aborted. > > Now, I don't fully understand resource allocation in these sorts of systems. > But I do get that the sort of "thread parallelism" done by e.g. the > `parallel` package in R isn't identical to the sort of parallelism commonly > done in lower-level languages. For example, when I request a node, I only > ask for one of its cores. My R script then detects the number of cores on > the node, and farms out tasks to the cores via the `foreach` package. My > understanding is that lower-level languages need the number of cores to be > specified in the shell script, and a particular job script is given directly > to each worker. > > My problem is that my parallel-calling R script is crashing the cluster, > which terminates my script because the sum of the memory being requested by > each thread is greater than what I'm allocated. I don't get this problem > when running on my laptop's 4 cores, presumably because my laptop has a > higher ratio of memory/core. > > My question: how can I ensure that the total memory being requested by N > workers remains below a certain threshold? Is this even possible? If not, > is it possible to benchmark a process locally, collecting the maximum > per-worker memory requested, and use this to back out the number of workers > that I can request for a given node's memory limit? > > Thanks in advance! > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.