Dear R-devel, I am running mclapply with many iterations over a function that modifies nothing and makes no copies of anything. It is taking up a lot of memory, so it seems to me like this is a bug. Should I post this to bugs.r-project.org? A minimal reproducible example can be obtained by first starting a memory monitoring program such as htop, and then executing the following code while looking at how much memory is being used by the system library(parallel) seconds <- 5 N <- 100000 result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds)) On my system, memory usage goes up about 60MB on this example. But it does not go up at all if I change mclapply to lapply. Is this a bug? For a more detailed discussion with a figure that shows that the memory overhead is linear in N, please see https://github.com/tdhock/mclapply-memory> sessionInfo()R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu precise (12.04.5 LTS) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel graphics utils datasets stats grDevices methods [8] base other attached packages: [1] ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33 loaded via a namespace (and not attached): [1] Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43 [4] grid_3.2.2 plyr_1.8.1 gtable_0.1.2 [7] scales_0.2.3 reshape2_1.2.2 proto_1.0.0 [10] labeling_0.2 tools_3.2.2 stringr_0.6.2 [13] dichromat_2.0-0 munsell_0.4.2 PeakSegJoint_2015.08.06 [16] compiler_3.2.2 colorspace_1.2-4 [[alternative HTML version deleted]]
Well it's only a leak if you don't get the memory back after it returns, right? Anyway, one (untested by me) possibility is the copying of memory pages when the garbage collector touches objects, as pointed out by Radford Neal here: http://r.789695.n4.nabble.com/Re-R-devel-Digest-Vol-149-Issue-22-td4710367.html If so, I don't think this would be easily avoidable, but there may be mitigation strategies. ~G On Wed, Sep 2, 2015 at 10:12 AM, Toby Hocking <tdhock5 at gmail.com> wrote:> Dear R-devel, > > I am running mclapply with many iterations over a function that modifies > nothing and makes no copies of anything. It is taking up a lot of memory, > so it seems to me like this is a bug. Should I post this to > bugs.r-project.org? > > A minimal reproducible example can be obtained by first starting a memory > monitoring program such as htop, and then executing the following code > while looking at how much memory is being used by the system > > library(parallel) > seconds <- 5 > N <- 100000 > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds)) > > On my system, memory usage goes up about 60MB on this example. But it does > not go up at all if I change mclapply to lapply. Is this a bug? > > For a more detailed discussion with a figure that shows that the memory > overhead is linear in N, please see > https://github.com/tdhock/mclapply-memory > > > sessionInfo() > R version 3.2.2 (2015-08-14) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu precise (12.04.5 LTS) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel graphics utils datasets stats grDevices methods > [8] base > > other attached packages: > [1] ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33 > > loaded via a namespace (and not attached): > [1] Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43 > [4] grid_3.2.2 plyr_1.8.1 gtable_0.1.2 > [7] scales_0.2.3 reshape2_1.2.2 proto_1.0.0 > [10] labeling_0.2 tools_3.2.2 stringr_0.6.2 > [13] dichromat_2.0-0 munsell_0.4.2 > PeakSegJoint_2015.08.06 > [16] compiler_3.2.2 colorspace_1.2-4 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Gabriel Becker, PhD Computational Biologist Bioinformatics and Computational Biology Genentech, Inc. [[alternative HTML version deleted]]
right, it is not a memory leak, sorry for the misleading subject line. the problem is the fact that the memory usage goes up, linearly with the length of the first argument to mclapply. in practice with large data sets this can cause the machine to start swapping, or to have my cluster jobs killed due to using too much memory. On Wed, Sep 2, 2015 at 2:35 PM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:> Well it's only a leak if you don't get the memory back after it returns, > right? > > Anyway, one (untested by me) possibility is the copying of memory pages > when the garbage collector touches objects, as pointed out by Radford Neal > here: > http://r.789695.n4.nabble.com/Re-R-devel-Digest-Vol-149-Issue-22-td4710367.html > > If so, I don't think this would be easily avoidable, but there may be > mitigation strategies. > > ~G > > On Wed, Sep 2, 2015 at 10:12 AM, Toby Hocking <tdhock5 at gmail.com> wrote: > >> Dear R-devel, >> >> I am running mclapply with many iterations over a function that modifies >> nothing and makes no copies of anything. It is taking up a lot of memory, >> so it seems to me like this is a bug. Should I post this to >> bugs.r-project.org? >> >> A minimal reproducible example can be obtained by first starting a memory >> monitoring program such as htop, and then executing the following code >> while looking at how much memory is being used by the system >> >> library(parallel) >> seconds <- 5 >> N <- 100000 >> result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds)) >> >> On my system, memory usage goes up about 60MB on this example. But it does >> not go up at all if I change mclapply to lapply. Is this a bug? >> >> For a more detailed discussion with a figure that shows that the memory >> overhead is linear in N, please see >> https://github.com/tdhock/mclapply-memory >> >> > sessionInfo() >> R version 3.2.2 (2015-08-14) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu precise (12.04.5 LTS) >> >> locale: >> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel graphics utils datasets stats grDevices methods >> [8] base >> >> other attached packages: >> [1] ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33 >> >> loaded via a namespace (and not attached): >> [1] Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43 >> [4] grid_3.2.2 plyr_1.8.1 gtable_0.1.2 >> [7] scales_0.2.3 reshape2_1.2.2 proto_1.0.0 >> [10] labeling_0.2 tools_3.2.2 stringr_0.6.2 >> [13] dichromat_2.0-0 munsell_0.4.2 >> PeakSegJoint_2015.08.06 >> [16] compiler_3.2.2 colorspace_1.2-4 >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > > > -- > Gabriel Becker, PhD > Computational Biologist > Bioinformatics and Computational Biology > Genentech, Inc. >[[alternative HTML version deleted]]
Toby,> On Sep 2, 2015, at 1:12 PM, Toby Hocking <tdhock5 at gmail.com> wrote: > > Dear R-devel, > > I am running mclapply with many iterations over a function that modifies > nothing and makes no copies of anything. It is taking up a lot of memory, > so it seems to me like this is a bug. Should I post this to > bugs.r-project.org? > > A minimal reproducible example can be obtained by first starting a memory > monitoring program such as htop, and then executing the following code > while looking at how much memory is being used by the system > > library(parallel) > seconds <- 5 > N <- 100000 > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds)) > > On my system, memory usage goes up about 60MB on this example. But it does > not go up at all if I change mclapply to lapply. Is this a bug? > > For a more detailed discussion with a figure that shows that the memory > overhead is linear in N, please see > https://github.com/tdhock/mclapply-memory >I'm not quite sure what is supposed to be the issue here. One would expect the memory used will be linear in the number elements you process - by definition of the task, since you'll be creating linearly many more objects. Also using top doesn't actually measure the memory used by R itself (see FAQ 7.42). That said, I re-run your script and it didn't look anything like what you have on your webpage. For the NULL result you end up dealing will all the objects you create in your test that overshadow any memory usage and stabilizes after garbage-collection. As you would expect, any output of top is essentially bogus up to a gc. How much memory R will use is essentially governed by the level at which you set the gc trigger. In real world you actually want that to be fairly high if you can afford it (in gigabytes, not megabytes), because you get often much higher performance by delaying gcs if you don't have low total memory (essentially using the memory as a buffer). Given that the usage is so negligible, it won't trigger any gc on its own, so you're just measuring accumulated objects - which will be always higher for mclapply because of the bookkeeping and serialization involved in the communication. The real difference is only in the df case. The reason for it is that your lapply() there is simply a no-op, because R is smart enough to realize that you are always returning the same object so it won't actually create anything and just return a reference back to df - thus using no memory at all. However, once you split the inputs, your main session can no longer perform this optimization because the processing is now in a separate process, so it has no way of knowing that you are returning the object unmodified. So what you are measuring is a special case that is arguably not really relevant in real applications. Cheers, Simon>> sessionInfo() > R version 3.2.2 (2015-08-14) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu precise (12.04.5 LTS) > > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel graphics utils datasets stats grDevices methods > [8] base > > other attached packages: > [1] ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33 > > loaded via a namespace (and not attached): > [1] Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43 > [4] grid_3.2.2 plyr_1.8.1 gtable_0.1.2 > [7] scales_0.2.3 reshape2_1.2.2 proto_1.0.0 > [10] labeling_0.2 tools_3.2.2 stringr_0.6.2 > [13] dichromat_2.0-0 munsell_0.4.2 PeakSegJoint_2015.08.06 > [16] compiler_3.2.2 colorspace_1.2-4 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Thanks for the detailed analysis Simon. I figured out a workaround that seems to be working in my real application. By limiting the length of the first argument to mclapply (to the number of cores), I get speedups while limiting the memory overhead. ### Run mclapply inside of a for loop, ensuring that it never receives ### a first argument with a length more than maxjobs. This avoids some ### memory problems (swapping, or getting jobs killed on the cluster) ### when using mclapply(1:N, FUN) where N is large. maxjobs.mclapply <- function(X, FUN, maxjobs=getOption("mc.cores")){ N <- length(X) i.list <- splitIndices(N, N/maxjobs) result.list <- list() for(i in seq_along(i.list)){ i.vec <- i.list[[i]] result.list[i.vec] <- mclapply(X[i.vec], FUN) } result.list } On Thu, Sep 3, 2015 at 5:27 PM, Simon Urbanek <simon.urbanek at r-project.org> wrote:> Toby, > > > On Sep 2, 2015, at 1:12 PM, Toby Hocking <tdhock5 at gmail.com> wrote: > > > > Dear R-devel, > > > > I am running mclapply with many iterations over a function that modifies > > nothing and makes no copies of anything. It is taking up a lot of memory, > > so it seems to me like this is a bug. Should I post this to > > bugs.r-project.org? > > > > A minimal reproducible example can be obtained by first starting a memory > > monitoring program such as htop, and then executing the following code > > while looking at how much memory is being used by the system > > > > library(parallel) > > seconds <- 5 > > N <- 100000 > > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds)) > > > > On my system, memory usage goes up about 60MB on this example. But it > does > > not go up at all if I change mclapply to lapply. Is this a bug? > > > > For a more detailed discussion with a figure that shows that the memory > > overhead is linear in N, please see > > https://github.com/tdhock/mclapply-memory > > > > > I'm not quite sure what is supposed to be the issue here. One would expect > the memory used will be linear in the number elements you process - by > definition of the task, since you'll be creating linearly many more objects. > > Also using top doesn't actually measure the memory used by R itself (see > FAQ 7.42). > > That said, I re-run your script and it didn't look anything like what you > have on your webpage. For the NULL result you end up dealing will all the > objects you create in your test that overshadow any memory usage and > stabilizes after garbage-collection. As you would expect, any output of top > is essentially bogus up to a gc. How much memory R will use is essentially > governed by the level at which you set the gc trigger. In real world you > actually want that to be fairly high if you can afford it (in gigabytes, > not megabytes), because you get often much higher performance by delaying > gcs if you don't have low total memory (essentially using the memory as a > buffer). Given that the usage is so negligible, it won't trigger any gc on > its own, so you're just measuring accumulated objects - which will be > always higher for mclapply because of the bookkeeping and serialization > involved in the communication. > > The real difference is only in the df case. The reason for it is that your > lapply() there is simply a no-op, because R is smart enough to realize that > you are always returning the same object so it won't actually create > anything and just return a reference back to df - thus using no memory at > all. However, once you split the inputs, your main session can no longer > perform this optimization because the processing is now in a separate > process, so it has no way of knowing that you are returning the object > unmodified. So what you are measuring is a special case that is arguably > not really relevant in real applications. > > Cheers, > Simon > > > > >> sessionInfo() > > R version 3.2.2 (2015-08-14) > > Platform: x86_64-pc-linux-gnu (64-bit) > > Running under: Ubuntu precise (12.04.5 LTS) > > > > locale: > > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel graphics utils datasets stats grDevices methods > > [8] base > > > > other attached packages: > > [1] ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33 > > > > loaded via a namespace (and not attached): > > [1] Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43 > > [4] grid_3.2.2 plyr_1.8.1 gtable_0.1.2 > > [7] scales_0.2.3 reshape2_1.2.2 proto_1.0.0 > > [10] labeling_0.2 tools_3.2.2 stringr_0.6.2 > > [13] dichromat_2.0-0 munsell_0.4.2 > PeakSegJoint_2015.08.06 > > [16] compiler_3.2.2 colorspace_1.2-4 > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > >[[alternative HTML version deleted]]