Paul Johnson
2012-Feb-17 20:57 UTC
[Rd] portable parallel seeds project: request for critiques
I've got another edition of my simulation replication framework. I'm attaching 2 R files and pasting in the readme. I would especially like to know if I'm doing anything that breaks .Random.seed or other things that R's parallel uses in the environment. In case you don't want to wrestle with attachments, the same files are online in our SVN http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ ## Paul E. Johnson CRMDA <pauljohn at ku.edu> ## Portable Parallel Seeds Project. ## 2012-02-18 Portable Parallel Seeds Project This is how I'm going to recommend we work with random number seeds in simulations. It enhances work that requires runs with random numbers, whether runs are in a cluster computing environment or in a single workstation. It is a solution for two separate problems. Problem 1. I scripted up 1000 R runs and need high quality, unique, replicable random streams for each one. Each simulation runs separately, but I need to be confident their streams are not correlated or overlapping. For replication, I need to be able to select any run, say 667, and restart it exactly as it was. Problem 2. I've written a Parallel MPI (Message Passing Interface) routine that launches 1000 runs and I need to assure each has a unique, replicatable, random stream. I need to be able to select any run, say 667, and restart it exactly as it was. This project develops one approach to create replicable simulations. It blends ideas about seed management from John M. Chambers Software for Data Analysis (2008) with ideas from the snowFT package by Hana Sevcikova and Tony R. Rossini. Here's my proposal. 1. Run a preliminary program to generate an array of seeds run1: seed1.1 seed1.2 seed1.3 run2: seed2.1 seed2.2 seed2.3 run3: seed3.1 seed3.2 seed3.3 ... ... ... run1000 seed1000.1 seed1000.2 seed1000.3 This example provides 3 separate streams of random numbers within each run. Because we will use the L'Ecuyer "many separate streams" approach, we are confident that there is no correlation or overlap between any of the runs. The projSeeds has to have one row per project, but it is not a huge file. I created seeds for 2000 runs of a project that requires 2 seeds per run. The saved size of the file 104443kb, which is very small. By comparison, a 1400x1050 jpg image would usually be twice that size. If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, still pretty small. Because the seeds are saved in a file, we are sure each run can be replicated. We just have to teach each program how to use the seeds. That is step two. 2. Inside each run, an initialization function runs that loads the seeds file and takes the row of seeds that it needs. As the simulation progresses, the user can ask for random numbers from the separate streams. When we need random draws from a particular stream, we set the variable "currentStream" with the function useStream(). The function initSeedStreams creates several objects in the global environment. It sets the integer currentStream, as well as two list objects, startSeeds and currentSeeds. At the outset of the run, startSeeds and currentSeeds are the same thing. When we change the currentStream to a different stream, the currentSeeds vector is updated to remember where that stream was when we stopped drawing numbers from it. Now, for the proof of concept. A working example. Step 1. Create the Seeds. Review the R program seedCreator.R That creates the file "projSeeds.rda". Step 2. Use one row of seeds per run. Please review "controlledSeeds.R" to see an example usage that I've tested on a cluster. "controlledSeeds.R" can also be run on a single workstation for testing purposes. There is a variable "runningInMPI" which determines whether the code is supposed to run on the RMPI cluster or just in a single workstation. The code for each run of the model begins by loading the required libraries and loading the seed file, if it exists, or generating a new "projSeed" object if it is not found. library(parallel) RNGkind("L'Ecuyer-CMRG") set.seed(234234) if (file.exists("projSeeds.rda")) { load("projSeeds.rda") } else { source("seedCreator.R") } ## Suppose the "run" number is: run <- 232 initSeedStreams(run) After that, R's random generator functions will draw values from the first random random stream that was initialized in projSeeds. When each repetition (run) occurs, R looks up the right seed for that run, and uses it. If the user wants to begin drawing observations from the second random stream, this command is used: useStream(2) If the user has drawn values from stream 1 already, but wishes to begin again at the initial point in that stream, use this command useStream(1, origin = TRUE) Question: Why is this approach better for parallel runs? Answer: After a batch of simulations, we can re-start any one of them and repeat it exactly. This builds on the idea of the snowFT package, by Hana Sevcikova and A.J. Rossini. That is different from the default approach of most R parallel designs, including R's own parallel, RMPI and snow. The ordinary way of controlling seeds in R parallel would initialize the 50 nodes, and we would lose control over seeds because runs would be repeatedly assigned to nodes. The aim here is to make sure that each particular run has a known starting point. After a batch of 10,000 runs, we can look and say "something funny happened on run 1,323" and then we can bring that back to life later, easily. Question: Why is this better than the simple old approach of setting the seeds within each run with a formula like set.seed(2345 + 10 * run) Answer: That does allow replication, but it does not assure that each run uses non-overlapping random number streams. It offers absolutely no assurance whatsoever that the runs are actually non-redundant. Nevertheless, it is a method that is widely used and recommended by some visible HOWTO guides. Citations Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant Simple Network of Workstations. R package version 1.2-0. http://CRAN.R-project.org/package=snowFT John M Chambers (2008). SoDA: Functions and Exampels for "Software for Data Analysis". R package version 1.0-3. John M Chambers (2008) Software for Data Analysis. Springer. -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
Paul Gilbert
2012-Feb-17 21:23 UTC
[Rd] portable parallel seeds project: request for critiques
Paul I think (perhaps incorrectly) of the general problem being that one wants to run a random experiment, on a single node, or two nodes, or ten nodes, or any number of nodes, and reliably be able to reproduce the experiment without concern about how many nodes it runs on when you re-run it. From your description I don't have the impression your solution would do that. Am I misunderstanding? A second problem is that you want to use a proven algorithm for generating the numbers. This is implicitly solved by the above, because you always get the same result as you do on one node with a well proven RNG. If you generate a string of seed and then numbers from those, do you have a proven RNG? Paul On 12-02-17 03:57 PM, Paul Johnson wrote:> I've got another edition of my simulation replication framework. I'm > attaching 2 R files and pasting in the readme. > > I would especially like to know if I'm doing anything that breaks > .Random.seed or other things that R's parallel uses in the > environment. > > In case you don't want to wrestle with attachments, the same files are > online in our SVN > > http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ > > > ## Paul E. Johnson CRMDA<pauljohn at ku.edu> > ## Portable Parallel Seeds Project. > ## 2012-02-18 > > Portable Parallel Seeds Project > > This is how I'm going to recommend we work with random number seeds in > simulations. It enhances work that requires runs with random numbers, > whether runs are in a cluster computing environment or in a single > workstation. > > It is a solution for two separate problems. > > Problem 1. I scripted up 1000 R runs and need high quality, > unique, replicable random streams for each one. Each simulation > runs separately, but I need to be confident their streams are > not correlated or overlapping. For replication, I need to be able to > select any run, say 667, and restart it exactly as it was. > > Problem 2. I've written a Parallel MPI (Message Passing Interface) > routine that launches 1000 runs and I need to assure each has > a unique, replicatable, random stream. I need to be able to > select any run, say 667, and restart it exactly as it was. > > This project develops one approach to create replicable simulations. > It blends ideas about seed management from John M. Chambers > Software for Data Analysis (2008) with ideas from the snowFT > package by Hana Sevcikova and Tony R. Rossini. > > > Here's my proposal. > > 1. Run a preliminary program to generate an array of seeds > > run1: seed1.1 seed1.2 seed1.3 > run2: seed2.1 seed2.2 seed2.3 > run3: seed3.1 seed3.2 seed3.3 > ... ... ... > run1000 seed1000.1 seed1000.2 seed1000.3 > > This example provides 3 separate streams of random numbers within each > run. Because we will use the L'Ecuyer "many separate streams" > approach, we are confident that there is no correlation or overlap > between any of the runs. > > The projSeeds has to have one row per project, but it is not a huge > file. I created seeds for 2000 runs of a project that requires 2 seeds > per run. The saved size of the file 104443kb, which is very small. By > comparison, a 1400x1050 jpg image would usually be twice that size. > If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, > still pretty small. > > Because the seeds are saved in a file, we are sure each > run can be replicated. We just have to teach each program > how to use the seeds. That is step two. > > > 2. Inside each run, an initialization function runs that loads the > seeds file and takes the row of seeds that it needs. As the > simulation progresses, the user can ask for random numbers from the > separate streams. When we need random draws from a particular stream, > we set the variable "currentStream" with the function useStream(). > > The function initSeedStreams creates several objects in > the global environment. It sets the integer currentStream, > as well as two list objects, startSeeds and currentSeeds. > At the outset of the run, startSeeds and currentSeeds > are the same thing. When we change the currentStream > to a different stream, the currentSeeds vector is > updated to remember where that stream was when we stopped > drawing numbers from it. > > > Now, for the proof of concept. A working example. > > Step 1. Create the Seeds. Review the R program > > seedCreator.R > > That creates the file "projSeeds.rda". > > > Step 2. Use one row of seeds per run. > > Please review "controlledSeeds.R" to see an example usage > that I've tested on a cluster. > > "controlledSeeds.R" can also be run on a single workstation for > testing purposes. There is a variable "runningInMPI" which determines > whether the code is supposed to run on the RMPI cluster or just in a > single workstation. > > > The code for each run of the model begins by loading the > required libraries and loading the seed file, if it exists, or > generating a new "projSeed" object if it is not found. > > library(parallel) > RNGkind("L'Ecuyer-CMRG") > set.seed(234234) > if (file.exists("projSeeds.rda")) { > load("projSeeds.rda") > } else { > source("seedCreator.R") > } > > ## Suppose the "run" number is: > run<- 232 > initSeedStreams(run) > > After that, R's random generator functions will draw values > from the first random random stream that was initialized > in projSeeds. When each repetition (run) occurs, > R looks up the right seed for that run, and uses it. > > If the user wants to begin drawing observations from the > second random stream, this command is used: > > useStream(2) > > If the user has drawn values from stream 1 already, but > wishes to begin again at the initial point in that stream, > use this command > > useStream(1, origin = TRUE) > > > Question: Why is this approach better for parallel runs? > > Answer: After a batch of simulations, we can re-start any > one of them and repeat it exactly. This builds on the idea > of the snowFT package, by Hana Sevcikova and A.J. Rossini. > > That is different from the default approach of most R parallel > designs, including R's own parallel, RMPI and snow. > > The ordinary way of controlling seeds in R parallel would initialize > the 50 nodes, and we would lose control over seeds because runs would > be repeatedly assigned to nodes. The aim here is to make sure that > each particular run has a known starting point. After a batch of > 10,000 runs, we can look and say "something funny happened on run > 1,323" and then we can bring that back to life later, easily. > > > > Question: Why is this better than the simple old approach of > setting the seeds within each run with a formula like > > set.seed(2345 + 10 * run) > > Answer: That does allow replication, but it does not assure > that each run uses non-overlapping random number streams. It > offers absolutely no assurance whatsoever that the runs are > actually non-redundant. > > Nevertheless, it is a method that is widely used and recommended > by some visible HOWTO guides. > > > > Citations > > Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant > Simple Network of Workstations. R package version 1.2-0. > http://CRAN.R-project.org/package=snowFT > > John M Chambers (2008). SoDA: Functions and Exampels for "Software > for Data Analysis". R package version 1.0-3. > > John M Chambers (2008) Software for Data Analysis. Springer. > > > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Petr Savicky
2012-Feb-17 23:06 UTC
[Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:> I've got another edition of my simulation replication framework. I'm > attaching 2 R files and pasting in the readme. > > I would especially like to know if I'm doing anything that breaks > .Random.seed or other things that R's parallel uses in the > environment. > > In case you don't want to wrestle with attachments, the same files are > online in our SVN > > http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ > > > ## Paul E. Johnson CRMDA <pauljohn at ku.edu> > ## Portable Parallel Seeds Project. > ## 2012-02-18 > > Portable Parallel Seeds Project > > This is how I'm going to recommend we work with random number seeds in > simulations. It enhances work that requires runs with random numbers, > whether runs are in a cluster computing environment or in a single > workstation. > > It is a solution for two separate problems. > > Problem 1. I scripted up 1000 R runs and need high quality, > unique, replicable random streams for each one. Each simulation > runs separately, but I need to be confident their streams are > not correlated or overlapping. For replication, I need to be able to > select any run, say 667, and restart it exactly as it was. > > Problem 2. I've written a Parallel MPI (Message Passing Interface) > routine that launches 1000 runs and I need to assure each has > a unique, replicatable, random stream. I need to be able to > select any run, say 667, and restart it exactly as it was. > > This project develops one approach to create replicable simulations. > It blends ideas about seed management from John M. Chambers > Software for Data Analysis (2008) with ideas from the snowFT > package by Hana Sevcikova and Tony R. Rossini. > > > Here's my proposal. > > 1. Run a preliminary program to generate an array of seeds > > run1: seed1.1 seed1.2 seed1.3 > run2: seed2.1 seed2.2 seed2.3 > run3: seed3.1 seed3.2 seed3.3 > ... ... ... > run1000 seed1000.1 seed1000.2 seed1000.3 > > This example provides 3 separate streams of random numbers within each > run. Because we will use the L'Ecuyer "many separate streams" > approach, we are confident that there is no correlation or overlap > between any of the runs. > > The projSeeds has to have one row per project, but it is not a huge > file. I created seeds for 2000 runs of a project that requires 2 seeds > per run. The saved size of the file 104443kb, which is very small. By > comparison, a 1400x1050 jpg image would usually be twice that size. > If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, > still pretty small. > > Because the seeds are saved in a file, we are sure each > run can be replicated. We just have to teach each program > how to use the seeds. That is step two.Hi. Some of the random number generators allow as a seed a vector, not only a single number. This can simplify generating the seeds. There can be one seed for each of the 1000 runs and then, the rows of the seed matrix can be c(seed1, 1), c(seed1, 2), ... c(seed2, 1), c(seed2, 2), ... c(seed3, 1), c(seed3, 2), ... ... There could be even only one seed and the matrix can be generated as c(seed, 1, 1), c(seed, 1, 2), ... c(seed, 2, 1), c(seed, 2, 2), ... c(seed, 3, 1), c(seed, 3, 2), ... If the initialization using the vector c(seed, i, j) is done with a good quality hash function, the runs will be independent. What is your opinion on this? An advantage of seeding with a vector is also that there can be significantly more initial states of the generator among which we select by the seed than 2^32, which is the maximum for a single integer seed. Petr Savicky.
Petr Savicky
2012-Feb-21 13:04 UTC
[Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:> I've got another edition of my simulation replication framework. I'm > attaching 2 R files and pasting in the readme. > > I would especially like to know if I'm doing anything that breaks > .Random.seed or other things that R's parallel uses in the > environment. > > In case you don't want to wrestle with attachments, the same files are > online in our SVN > > http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/Hi. In the description of your project in the file http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README you argue as follows Question: Why is this better than the simple old approach of setting the seeds within each run with a formula like set.seed(2345 + 10 * run) Answer: That does allow replication, but it does not assure that each run uses non-overlapping random number streams. It offers absolutely no assurance whatsoever that the runs are actually non-redundant. The following demonstrates that the function set.seed() for the default generator indeed allows to have correlated streams. step <- function(x) { x[x < 0] <- x[x < 0] + 2^32 x <- (69069 * x + 1) %% 2^32 x[x > 2^31] <- x[x > 2^31] - 2^32 x } n <- 1000 seed1 <- 124370417 # any seed seed2 <- step(seed1) set.seed(seed1) x <- runif(n) set.seed(seed2) y <- runif(n) rbind(seed1, seed2) table(x[-1] == y[-n]) The output is [,1] seed1 124370417 seed2 205739774 FALSE TRUE 5 994 This means that if the streams x, y are generated from the two seeds above, then y is almost exactly equal to x shifted by 1. What is the current state of your project? Petr Savicky.
Reasonably Related Threads
- Example of "task seeds" with R parallel. Critique?
- How to seed the R random number generator in C (standalone) with an instance of .Random.seed
- Create a matrix with increment and element with zero subscript
- glm offset and interaction bugs (PR#4941)
- .Random.seed