We have a project that calls for the creation of a list of many distribution objects. Distributions can be of various types, with various parameters, but we ran into some problems. I started testing on a simple list of rnorm-based objects. I was a little surprised at the RAM storage requirements, here's an example: N <- 10000 closureList <- vector("list", N) nsize = sample(x = 1:100, size = N, replace = TRUE) for (i in seq_along(nsize)){ closureList[[i]] <- list(func = rnorm, n = nsize[i]) } format(object.size(closureList), units = "Mb") Output says 22.4 MB I noticed that if I do not name the objects in the list, then the storage drops to 19.9 MB. That seemed like a lot of storage for a function's name. Why so much? My colleagues think the RAM use is high because this is a closure (hence closureList). I can't even convince myself it actually is a closure. The R source has rnorm <- function(n, mean=0, sd=1) .Call(C_rnorm, n, mean, sd) The storage holding 10000 copies of rnorm, but we really only need 1, which we can use in the objects. Thinking of this like C, I am looking to pass in a pointer to the function. I found my way to the idea of putting a function in an environment in order to pass it by reference: rnormPointer <- function(inputValue1, inputValue2){ object <- new.env(parent=globalenv()) object$distr <- inputValue1 object$n <- inputValue2 class(object) <- 'pointer' object } ## Experiment with that gg <- rnormPointer(rnorm, 33) gg$distr(gg$n) ptrList <- vector("list", N) for(i in seq_along(nsize)) { ptrList[[i]] <- rnormPointer(rnorm, nsize[i]) } format(object.size(ptrList), units = "Mb") The required storage is reduced to 2.6 Mb. Thats 1/10 of the RAM required for closureList. This thing works in the way I expect ## can pass in the unnamed arguments for n, mean and sd here ptrList[[1]]$distr(33, 100, 10) ## Or the named arguments ptrList[[1]]$distr(1, sd = 100) This environment trick mostly works, so far as I can see, but I have these questions. 1. Is the object.size() return accurate for ptrList? Do I really reduce storage to that amount, or is the required storage someplace else (in the new environment) that is not included in object.size()? 2. Am I running with scissors here? Unexpected bad things await? 3. Why is the storage for closureList so great? It looks to me like rnorm is just this little thing: function (n, mean = 0, sd = 1) .Call(C_rnorm, n, mean, sd) <bytecode: 0x55cc9988cae0> 4. Could I learn (you show me?) to store the bytecode address as a thing and use it in the objects? I'd guess that is the fastest possible way. In an Objective-C problem in the olden days, we found the method-lookup was a major slowdown and one of the programmers showed us how to save the lookup and use it over and over. pj -- Paul E. Johnson http://pj.freefaculty.org Director, Center for Research Methods and Data Analysis http://crmda.ku.edu To write to me directly, please address me at pauljohn at ku.edu.
On 22/11/2017 11:29 AM, Paul Johnson wrote:> We have a project that calls for the creation of a list of many > distribution objects. Distributions can be of various types, with > various parameters, but we ran into some problems. I started testing > on a simple list of rnorm-based objects. > > I was a little surprised at the RAM storage requirements, here's an example: > > N <- 10000 > closureList <- vector("list", N) > nsize = sample(x = 1:100, size = N, replace = TRUE) > for (i in seq_along(nsize)){ > closureList[[i]] <- list(func = rnorm, n = nsize[i]) > } > format(object.size(closureList), units = "Mb") > > Output says > 22.4 MB >You should read the help page for object.size. You're doing exactly the kind of thing that causes it to give overestimates of the amount of memory being used. I'd suggest turning on memory profiling in Rprof() for a more accurate result, but it seems to be broken: > Rprof(memory.profiling=TRUE) > N <- 10000 > closureList <- vector("list", N) > nsize = sample(x = 1:100, size = N, replace = TRUE) > for (i in seq_along(nsize)){ + closureList[[i]] <- list(func = rnorm, n = nsize[i]) + } > format(object.size(closureList), units = "Mb") [1] "19.2 Mb" > Rprof(NULL) > summaryRprof() Error in rowsum.default(c(as.vector(new.ftable), fcounts), c(names(new.ftable), : unimplemented type 'NULL' in 'HashTableSetup' In addition: Warning message: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' Duncan Murdoch> I noticed that if I do not name the objects in the list, then the > storage drops to 19.9 MB. > > That seemed like a lot of storage for a function's name. Why so much? > My colleagues think the RAM use is high because this is a closure > (hence closureList). I can't even convince myself it actually is a > closure. The R source has > > rnorm <- function(n, mean=0, sd=1) .Call(C_rnorm, n, mean, sd) > > The storage holding 10000 copies of rnorm, but we really only need 1, > which we can use in the objects. > > Thinking of this like C, I am looking to pass in a pointer to the > function. I found my way to the idea of putting a function in an > environment in order to pass it by reference: > > rnormPointer <- function(inputValue1, inputValue2){ > object <- new.env(parent=globalenv()) > object$distr <- inputValue1 > object$n <- inputValue2 > class(object) <- 'pointer' > object > } > > ## Experiment with that > gg <- rnormPointer(rnorm, 33) > gg$distr(gg$n) > > ptrList <- vector("list", N) > for(i in seq_along(nsize)) { > ptrList[[i]] <- rnormPointer(rnorm, nsize[i]) > } > format(object.size(ptrList), units = "Mb") > > The required storage is reduced to 2.6 Mb. Thats 1/10 of the RAM > required for closureList. This thing works in the way I expect > > ## can pass in the unnamed arguments for n, mean and sd here > ptrList[[1]]$distr(33, 100, 10) > ## Or the named arguments > ptrList[[1]]$distr(1, sd = 100) > > This environment trick mostly works, so far as I can see, but I have > these questions. > > 1. Is the object.size() return accurate for ptrList? Do I really > reduce storage to that amount, or is the required storage someplace > else (in the new environment) that is not included in object.size()? > > 2. Am I running with scissors here? Unexpected bad things await? > > 3. Why is the storage for closureList so great? It looks to me like > rnorm is just this little thing: > > function (n, mean = 0, sd = 1) > .Call(C_rnorm, n, mean, sd) > <bytecode: 0x55cc9988cae0> > > 4. Could I learn (you show me?) to store the bytecode address as a > thing and use it in the objects? I'd guess that is the fastest > possible way. In an Objective-C problem in the olden days, we found > the method-lookup was a major slowdown and one of the programmers > showed us how to save the lookup and use it over and over. > > pj > > >
I am replying to the first part of the question about the size of the object. It is probably best to use the "object_size" function in the "pryr" package: ?object_size? works similarly to ?object.size?, but counts more accurately and includes the size of environments. ?compare_size? makes it easy to compare the output of ?object_size? and ?object.size?. Here is what you get from the same code:> N <- 10000 > closureList <- vector("list", N) > nsize = sample(x = 1:100, size = N, replace = TRUE) > for (i in seq_along(nsize)){+ closureList[[i]] <- list(func = rnorm, n = nsize[i]) + }> format(object.size(closureList), units = "Mb")[1] "22.4 Mb"> pryr::compare_size(closureList)base pryr 23520040 2241776 You will notice that you get back a size that is 10X smaller because it is accounting for the shared space. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Wed, Nov 22, 2017 at 11:29 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:> We have a project that calls for the creation of a list of many > distribution objects. Distributions can be of various types, with > various parameters, but we ran into some problems. I started testing > on a simple list of rnorm-based objects. > > I was a little surprised at the RAM storage requirements, here's an > example: > > N <- 10000 > closureList <- vector("list", N) > nsize = sample(x = 1:100, size = N, replace = TRUE) > for (i in seq_along(nsize)){ > closureList[[i]] <- list(func = rnorm, n = nsize[i]) > } > format(object.size(closureList), units = "Mb") > > Output says > 22.4 MB > > I noticed that if I do not name the objects in the list, then the > storage drops to 19.9 MB. > > That seemed like a lot of storage for a function's name. Why so much? > My colleagues think the RAM use is high because this is a closure > (hence closureList). I can't even convince myself it actually is a > closure. The R source has > > rnorm <- function(n, mean=0, sd=1) .Call(C_rnorm, n, mean, sd) > > The storage holding 10000 copies of rnorm, but we really only need 1, > which we can use in the objects. > > Thinking of this like C, I am looking to pass in a pointer to the > function. I found my way to the idea of putting a function in an > environment in order to pass it by reference: > > rnormPointer <- function(inputValue1, inputValue2){ > object <- new.env(parent=globalenv()) > object$distr <- inputValue1 > object$n <- inputValue2 > class(object) <- 'pointer' > object > } > > ## Experiment with that > gg <- rnormPointer(rnorm, 33) > gg$distr(gg$n) > > ptrList <- vector("list", N) > for(i in seq_along(nsize)) { > ptrList[[i]] <- rnormPointer(rnorm, nsize[i]) > } > format(object.size(ptrList), units = "Mb") > > The required storage is reduced to 2.6 Mb. Thats 1/10 of the RAM > required for closureList. This thing works in the way I expect > > ## can pass in the unnamed arguments for n, mean and sd here > ptrList[[1]]$distr(33, 100, 10) > ## Or the named arguments > ptrList[[1]]$distr(1, sd = 100) > > This environment trick mostly works, so far as I can see, but I have > these questions. > > 1. Is the object.size() return accurate for ptrList? Do I really > reduce storage to that amount, or is the required storage someplace > else (in the new environment) that is not included in object.size()? > > 2. Am I running with scissors here? Unexpected bad things await? > > 3. Why is the storage for closureList so great? It looks to me like > rnorm is just this little thing: > > function (n, mean = 0, sd = 1) > .Call(C_rnorm, n, mean, sd) > <bytecode: 0x55cc9988cae0> > > 4. Could I learn (you show me?) to store the bytecode address as a > thing and use it in the objects? I'd guess that is the fastest > possible way. In an Objective-C problem in the olden days, we found > the method-lookup was a major slowdown and one of the programmers > showed us how to save the lookup and use it over and over. > > pj > > > > -- > Paul E. Johnson http://pj.freefaculty.org > Director, Center for Research Methods and Data Analysis > http://crmda.ku.edu > > To write to me directly, please address me at pauljohn at ku.edu. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]