Hi, I've implemented parallelization in one of my packages using the 'parallel' package -- many thanks for providing it! In my package I'm importing 'parallel' and so added it to the DESCRIPTION file's 'Import:' tag and also added a 'importFrom("parallel", ...)' statement in the NAMESPACE file. Parallelization works nicely, but my package no longer passes any parts of its (unparallelized) checks that depends on random number generation, e.g., the simulated data in the check suite are no longer the same as before parallelization was added. This seems to be due to 'parallel' changing '.Random.seed' when loading its name space: > set.seed(1) > rs1 <- .Random.seed > rnorm(1) [1] -0.6264538 > set.seed(1) > rs2 <- .Random.seed > identical(rs1, rs2) [1] TRUE > loadNamespace("parallel") <environment: namespace:parallel> > rs3 <- .Random.seed > identical(rs1, rs3) [1] FALSE > rnorm(1) [1] -0.3262334 > set.seed(1) > rs4 <- .Random.seed > identical(rs1, rs4) [1] TRUE I've taken a look at the 'parallel' source code, and in a few places a call to 'runif(1)' is issued. So, what effectively seems to happen when 'parallel' is loaded is > set.seed(1) > runif(1) [1] 0.2655087 > rnorm(1) [1] -0.3262334 which reproduces the above. But is this really necessary? And more importantly (at least to me): Can it somehow be avoided? The current state of affairs is a bit unfortunate, since it implies that a user just by loading the new parallelized version of my package can no longer reproduce any subsequent results depending on random number generation (unless a call to 'set.seed' was issued *after* attaching my package). I'd be most grateful for any help that you're able to provide here. Many thanks! Kind regards, Henric Winell> sessionInfo()R Under development (unstable) (2014-01-26 r64897) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=sv_SE.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.1.0 parallel_3.1.0 tools_3.1.0
Comments below. On 2014-03-06 11:17, Henric Winell wrote:> Hi, > > I've implemented parallelization in one of my packages using the > 'parallel' package -- many thanks for providing it! > > In my package I'm importing 'parallel' and so added it to the > DESCRIPTION file's 'Import:' tag and also added a > 'importFrom("parallel", ...)' statement in the NAMESPACE file. > > Parallelization works nicely, but my package no longer passes any parts > of its (unparallelized) checks that depends on random number generation, > e.g., the simulated data in the check suite are no longer the same as > before parallelization was added. This seems to be due to 'parallel' > changing '.Random.seed' when loading its name space: > > > set.seed(1) > > rs1 <- .Random.seed > > rnorm(1) > [1] -0.6264538 > > set.seed(1) > > rs2 <- .Random.seed > > identical(rs1, rs2) > [1] TRUE > > loadNamespace("parallel") > <environment: namespace:parallel> > > rs3 <- .Random.seed > > identical(rs1, rs3) > [1] FALSE > > rnorm(1) > [1] -0.3262334 > > set.seed(1) > > rs4 <- .Random.seed > > identical(rs1, rs4) > [1] TRUE > > I've taken a look at the 'parallel' source code, and in a few places a > call to 'runif(1)' is issued. So, what effectively seems to happen when > 'parallel' is loaded is > > > set.seed(1) > > runif(1) > [1] 0.2655087 > > rnorm(1) > [1] -0.3262334Some digging reveals that this is due to no port number for the socket connection being set by default, in which case 'parallel' picks a random port in the 11000-11999 range using 'runif(1L)'. So, by setting R_PARALLEL_PORT the '.Random.seed' object is no longer touched: > Sys.setenv(R_PARALLEL_PORT = 11500) > set.seed(1) > rs1 <- .Random.seed > loadNamespace("parallel") <environment: namespace:parallel> > rs2 <- .Random.seed > identical(rs1, rs2) [1] TRUE This is handled in the 'initDefaultClusterOptions' function in 'snow.R', where line 88 has port <- 11000 + 1000 * ((stats::runif(1L) + unclass(Sys.time())/300)%%1) It seems to me that we can tread more carefully here. I've attached a trivial patch that 1. Checks if '.Random.seed' exists 2. If TRUE: a) save '.Random.seed' b) make the call above c) reset '.Random.seed' to its state in a) If FALSE: a) make the call above b) remove '.Random.seed' In due course I hope someone is interested enough to review it. Henric Winell> > which reproduces the above. But is this really necessary? And more > importantly (at least to me): Can it somehow be avoided? > > The current state of affairs is a bit unfortunate, since it implies that > a user just by loading the new parallelized version of my package can no > longer reproduce any subsequent results depending on random number > generation (unless a call to 'set.seed' was issued *after* attaching my > package). > > I'd be most grateful for any help that you're able to provide here. Many > thanks! > > Kind regards, > Henric Winell > > >> sessionInfo() > R Under development (unstable) (2014-01-26 r64897) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=sv_SE.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.1.0 parallel_3.1.0 tools_3.1.0 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-------------- next part -------------- A non-text attachment was scrubbed... Name: snow.R.patch Type: text/x-patch Size: 1138 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140306/e83ca0fc/attachment.bin>
On 06/03/2014 10:17, Henric Winell wrote:> Hi, > > I've implemented parallelization in one of my packages using the > 'parallel' package -- many thanks for providing it! > > In my package I'm importing 'parallel' and so added it to the > DESCRIPTION file's 'Import:' tag and also added a > 'importFrom("parallel", ...)' statement in the NAMESPACE file. > > Parallelization works nicely, but my package no longer passes any parts > of its (unparallelized) checks that depends on random number generation, > e.g., the simulated data in the check suite are no longer the same as > before parallelization was added. This seems to be due to 'parallel' > changing '.Random.seed' when loading its name space: > > > set.seed(1) > > rs1 <- .Random.seed > > rnorm(1) > [1] -0.6264538 > > set.seed(1) > > rs2 <- .Random.seed > > identical(rs1, rs2) > [1] TRUE > > loadNamespace("parallel") > <environment: namespace:parallel> > > rs3 <- .Random.seed > > identical(rs1, rs3) > [1] FALSE > > rnorm(1) > [1] -0.3262334 > > set.seed(1) > > rs4 <- .Random.seed > > identical(rs1, rs4) > [1] TRUE > > I've taken a look at the 'parallel' source code, and in a few places a > call to 'runif(1)' is issued. So, what effectively seems to happen when > 'parallel' is loaded is > > > set.seed(1) > > runif(1) > [1] 0.2655087 > > rnorm(1) > [1] -0.3262334 > > which reproduces the above. But is this really necessary?Yes, in the places it is used. Two are to do with setting up parallel streams when called, and the other is only called if R_PARALLEL_PORT is unset. So set R_PARALLEL_PORT. But your presumptions are wrong: R is perfectly entitled to use its random number generator, as is other code running in the R interpreter. Once your call returns you cannot expect the session state to remain unchanged. And more> importantly (at least to me): Can it somehow be avoided? > > The current state of affairs is a bit unfortunate, since it implies that > a user just by loading the new parallelized version of my package can no > longer reproduce any subsequent results depending on random number > generation (unless a call to 'set.seed' was issued *after* attaching my > package). > > I'd be most grateful for any help that you're able to provide here. Many > thanks! > > Kind regards, > Henric Winell > > >> sessionInfo() > R Under development (unstable) (2014-01-26 r64897)See what the posting guide says about updating before posting ....> Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=sv_SE.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.1.0 parallel_3.1.0 tools_3.1.0 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Maybe Matching Threads
- Efficiency Question - Nested lapply or nested for loop
- Dealing with information loss for widened integer operations at ISel time
- multiply-accumulate instruction
- multiply-accumulate instruction
- Dealing with information loss for widened integer operations at ISel time