Sklyar, Oleg (London)
2009-Nov-03 11:27 UTC
[Rd] likely bug in 'serialize' or please explain the memory usage
Hi all, assume the following problem: a function call takes a function object and a data variable and calls this function with this data on a remote host. It uses serialization to pass both the function and the data via a socket connection to a remote host. The problem is that depending on the way we call the same construct, the function may be serialized to include the data, which was not requested as the example below demonstrates (runnable). This is a problem for parallel computing. The problem described below is actually a problem for Rmpi and any other parallel implementation we tested leading to endless executions in some cases, where the total data passed is huge. Assume the below 'mycall' is the function that takes data and a function object, serializes them and calls the remote host. To make it runable I just print the size of the serialized objects. In a parallel apply implemention it would serialize individual list elements and a function and pass those over. Assuming 1 element is 1Mb and having 100 elements and a function as simple as function(z) z we would expect to pass around 100Mb of data, 1 Mb to each individual process. However what happens is that in some situations all 100Mb of data are passed to all the slaves as the function is serialized to include all of the data! This always happens when we make such a call from an S4 method when the function we is defined inline, see last example. Anybody can explain this, and possibly suggest a solution? Well, one is -- do not define functions to call in the same environment as the caller :( I do not have immediate access to the newest version of R, so would be grateful if sombody could test it in that and let me know if the problem is still there. The example is runnable. Thanks, Oleg Dr Oleg Sklyar Research Technologist AHL / Man Investments Ltd +44 (0)20 7144 3803 osklyar at maninvestments.com ------------------------------------------------------------------------ ------- mycall = function(x, fun) { FUN = serialize(fun, NULL) DAT = serialize(x, NULL) cat(sprintf("length FUN=%d; length DAT=%d\n", length(FUN), length(DAT))) invisible(NULL) ## return results of a call on a remote host with FUN and DAN } ## the function variant I will be passing into mycall innerfun = function(z) z x = runif(1e6) ## test run from the command line mycall(x, innerfun) # output: length FUN=106; length DAT=8000022 ## test run from within a function outerfun1 = function(x) mycall(x, innerfun) outerfun1(x) # output: length FUN=106; length DAT=8000022 ## test run from within a function, where function is defined within outerfun2 = function(x) { nestedfun = function(z) z mycall(x, nestedfun) } outerfun2(x) # output: length FUN=253; length DAT=8000022 setGeneric("outerfun3", function(x) standardGeneric("outerfun3")) ## define a method ## test run from within a method setMethod("outerfun3", "numeric", function(x) mycall(x, innerfun)) outerfun3(x) # output@ length FUN=106; length DAT=8000022 ## test run from within a method, where function is defined within setMethod("outerfun3", "numeric", function(x) { nestedfun = function(z) z mycall(x, nestedfun) }) ## THIS WILL BE WRONG! outerfun3(x) # output: length FUN=8001680; length DAT=8000022 -------------------------------------------------- R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods base ********************************************************************** Please consider the environment before printing this email or its attachments. The contents of this email are for the named addressees ...{{dropped:19}}
Duncan Murdoch
2009-Nov-03 11:59 UTC
[Rd] likely bug in 'serialize' or please explain the memory usage
I haven't had a chance to look really closely at this, but I would guess the problem is that in R functions are "closures". The environment attached to the function will be serialized along with it, so if you have a big dataset in the same environment, you'll get that too. I vaguely recall that the global environment and other system environments are handled specially, so that's not true for functions created at the top level, but I'd have to do some experiments to confirm. So the solution to your problem is to pay attention to the environment of the functions you create. If they need to refer to local variables in the creating frame, then you'll get all of them, so be careful about what you create there. If they don't need to refer to the local frame you can just attach a new smaller environment after building the function. Duncan Murdoch Sklyar, Oleg (London) wrote:> Hi all, > > assume the following problem: a function call takes a function object > and a data variable and calls this function with this data on a remote > host. It uses serialization to pass both the function and the data via a > socket connection to a remote host. The problem is that depending on the > way we call the same construct, the function may be serialized to > include the data, which was not requested as the example below > demonstrates (runnable). This is a problem for parallel computing. The > problem described below is actually a problem for Rmpi and any other > parallel implementation we tested leading to endless executions in some > cases, where the total data passed is huge. > > Assume the below 'mycall' is the function that takes data and a function > object, serializes them and calls the remote host. To make it runable I > just print the size of the serialized objects. In a parallel apply > implemention it would serialize individual list elements and a function > and pass those over. Assuming 1 element is 1Mb and having 100 elements > and a function as simple as function(z) z we would expect to pass around > 100Mb of data, 1 Mb to each individual process. However what happens is > that in some situations all 100Mb of data are passed to all the slaves > as the function is serialized to include all of the data! This always > happens when we make such a call from an S4 method when the function we > is defined inline, see last example. > > Anybody can explain this, and possibly suggest a solution? Well, one is > -- do not define functions to call in the same environment as the caller > :( > > I do not have immediate access to the newest version of R, so would be > grateful if sombody could test it in that and let me know if the problem > is still there. The example is runnable. > > Thanks, > Oleg > > Dr Oleg Sklyar > Research Technologist > AHL / Man Investments Ltd > +44 (0)20 7144 3803 > osklyar at maninvestments.com > > ------------------------------------------------------------------------ > ------- > > mycall = function(x, fun) { > FUN = serialize(fun, NULL) > DAT = serialize(x, NULL) > > cat(sprintf("length FUN=%d; length DAT=%d\n", length(FUN), > length(DAT))) > invisible(NULL) ## return results of a call on a remote host with > FUN and DAN > } > > ## the function variant I will be passing into mycall > innerfun = function(z) z > x = runif(1e6) > > ## test run from the command line > mycall(x, innerfun) > # output: length FUN=106; length DAT=8000022 > > ## test run from within a function > outerfun1 = function(x) mycall(x, innerfun) > outerfun1(x) > # output: length FUN=106; length DAT=8000022 > > ## test run from within a function, where function is defined within > outerfun2 = function(x) { > nestedfun = function(z) z > mycall(x, nestedfun) > } > outerfun2(x) > # output: length FUN=253; length DAT=8000022 > > setGeneric("outerfun3", function(x) standardGeneric("outerfun3")) > ## define a method > > ## test run from within a method > setMethod("outerfun3", "numeric", > function(x) mycall(x, innerfun)) > outerfun3(x) > # output@ length FUN=106; length DAT=8000022 > > ## test run from within a method, where function is defined within > setMethod("outerfun3", "numeric", > function(x) { > nestedfun = function(z) z > mycall(x, nestedfun) > }) > ## THIS WILL BE WRONG! > outerfun3(x) > # output: length FUN=8001680; length DAT=8000022 > > > -------------------------------------------------- > R version 2.9.0 (2009-04-17) > x86_64-unknown-linux-gnu > > locale: > C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > ********************************************************************** > Please consider the environment before printing this email or its attachments. > The contents of this email are for the named addressees ...{{dropped:19}} > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >