thr3ads.net - R help - [R] Random Seed Location [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2018-Mar-04 22:54 UTC

[R] Random Seed Location

The following helps identify when .GlobalEnv$.Random.seed has changed:

rng_tracker <- local({
  last <- .GlobalEnv$.Random.seed
  function(...) {
    curr <- .GlobalEnv$.Random.seed
    if (!identical(curr, last)) {
      warning(".Random.seed changed")
      last <<- curr
    }
    TRUE
  }
})

addTaskCallback(rng_tracker, name = "RNG tracker")


EXAMPLE:
> sample.int(1L)[1] 1
Warning: .Random.seed changed

This will help you find for instance:

## Loading ggplot2 does not affect the RNG seed> loadNamespace("ggplot2")<environment: namespace:ggplot2>

## But attaching it does> library("ggplot2")Warning: .Random.seed changed

which reveals:
> ggplot2:::.onAttachfunction (...)
{
    if (!interactive() || stats::runif(1) > 0.1)
        return()
    tips <- c("Need help? Try the ggplot2 mailing list:
http://groups.google.com/group/ggplot2.",
        "Find out what's changed in ggplot2 at
http://github.com/tidyverse/ggplot2/releases.",
        "Use suppressPackageStartupMessages() to eliminate package
startup messages.",
        "Stackoverflow is a great place to get help:
http://stackoverflow.com/tags/ggplot2.",
        "Need help getting started? Try the cookbook for R:
http://www.cookbook-r.com/Graphs/",
        "Want to understand how all the pieces fit together? Buy the
ggplot2 book: http://ggplot2.org/book/")
    tip <- sample(tips, 1)
    packageStartupMessage(paste(strwrap(tip), collapse = "\n"))
}
<environment: namespace:ggplot2>

There are probably many case of this in different R packages.


R WISH:

There could be a

preserveRandomSeed({
  tip <- sample(tips, 1)
})

function in R for these type of random needs where true random
properties are non-critical.  This type of
"draw-a-random-number-and-reset-the-seed" is for instance used in
parallel:::initDefaultClusterOptions() which is called when the
'parallel' package is loaded:

seed <- .GlobalEnv$.Random.seed
ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max
port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1)
if(is.null(seed)) ## there was none, initially
rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
else # reset
assign(".Random.seed", seed, envir = .GlobalEnv, inherits = FALSE)

/Henrik

On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <gwblack001 at sbcglobal.net>
wrote:> Thank you, everybody, who replied!  I appreciate your valuable advise!  I
will move the location of the set.seed() command to after all packages have been
installed and loaded.
>
> Best regards,
> Gary
>
> Sent from my iPad
>
>> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert902 at
gmail.com> wrote:
>>
>> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at
sbcglobal.net>
>> wrote:
>>
>> (Sorry to be a bit slow responding.)
>>
>> You have not supplied a complete example, which would be good in this
case because what you are suggesting could be a serious bug in R or a package.
Serious journals require reproducibility these days. For example, JSS is very
clear on this point.
>>
>> To your question
>> > My question simply is:  should the location of the set.seed
command matter,
>> > provided that it is applied before any commands which involve
randomness
>> > (such as partitioning)?
>>
>> the answer is no, it should not matter. But the proviso is important.
>>
>> You can determine where things are messing up using something like
>>
>> set.seed(654321)
>> zk <- RNGkind()        # [1] "Mersenne-Twister"
"Inversion"
>> zk
>> z <- runif(2)
>> z
>> set.seed(654321)
>>
>> #  install.packages(c('caret', 'ggplot2',
'e1071'))
>> library(caret)
>> all(runif(2)  == z)   # should be true but it is not always
>>
>> set.seed(654321)
>> library(ggplot2)
>> all(runif(2)  == z)   # should be true
>>
>> set.seed(654321)
>> library(e1071)
>> all(runif(2)  == z)   # should be true
>>
>> all(RNGkind() == zk)  # should be true
>>
>> On my computer package caret seems to sometimes, but not always, do
something that advances or changes the RNG. So you will need to set the seed
after that package is loaded if you want reproducibility.
>>
>> As Bill Dunlap points out, parallel can introduce much more complicated
issues. If you are in fact using parallel then we really need a new thread with
a better subject line, and the discussion will get much messier.
>>
>> The short answer is that, yes you should be able to get reproducible
results with parallel computing. If you cannot then you are almost certainly
doing something wrong. To publish you really must have reproducible results.
>>
>> In the example that Bill gave, I think the problem is that set.seed()
only resets the seed in the main thread, the nodes continue to operate with
unreset RNG. To demonstrate this to yourself you can do
>>
>> library(parallel)
>> cl <- parallel::makeCluster(3)
>> parallel::clusterCall(cl, function()set.seed(100))
>> parallel::clusterCall(cl, function()RNGkind())
>> parallel::clusterCall(cl, function()runif(2)) # similar result from all
nodes
>>                                               # [1] 0.3077661 0.2576725
>>
>> However, do *NOT* do that in real work. You will be getting the same
RNG stream from each node. If you are using random numbers and parallel you need
to read a lot more, and probably consider a variant of the
"L'Ecuyer" generator or something designed for parallel computing.
>>
>> One special point I will mention because it does not seem to be widely
>> appreciated: the number of nodes affects the random stream, so
recording the number of compute nodes along with the RNG and seed information is
important for reproducible results. This has the unfortunate consequence that an
experiment cannot be reproduced on a larger cluster. (If anyone knows
differently I would very much like to hear.)
>>
>> Paul Gilbert
>>
>>
>> > Hi all,
>> >
>> > For some odd reason when running na?ve bayes, k-NN, etc., I get
slightly
>> > different results (e.g., error rates, classification
probabilities) from
>> > run
>> > to run even though I am using the same random seed.
>> >
>> > Nothing else (input-wise) is changing, but my results are somewhat
>> > different
>> > from run to run.  The only randomness should be in the
partitioning, and I
>> > have set the seed before this point.
>> >
>> > My question simply is:  should the location of the set.seed
command matter,
>> > provided that it is applied before any commands which involve
randomness
>> > (such as partitioning)?
>> >
>> > If you need to see the code, it is below:
>> >
>> > Thank you,
>> > Gary
>> >
>> >
>> > A.      Separate the original (in-sample) data from the new
(out-of-sample)
>> > data.  Set a random seed.
>> >
>> >> InvestTech <- as.data.frame(InvestTechRevised)
>> >> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
>> >> InvestTech <- InvestTech[1:5000, ]
>> >> set.seed(654321)
>> >
>> > B.      Install and load the caret, ggplot2 and e1071 packages.
>> >
>> >> install.packages(?caret?)
>> >> install.packages(?ggplot2?)
>> >> install.packages(?e1071?)
>> >> library(caret)
>> >> library(ggplot2)
>> >> library(e1071)
>> >
>> > C.      Bin the predictor variables with approximately equal
counts using
>> > the cut_number function from the ggplot2 package.  We will use 20
bins.
>> >
>> >> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
>> >> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
>> >> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
>> >> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
>> >
>> > D.      Partition the original (in-sample) data into 60% training
and 40%
>> > validation sets.
>> >
>> >> n <- nrow(InvestTech)
>> >> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
>> >> InvestTechTrain <- InvestTech[train, ]
>> >> InvestTechVal <- InvestTech[-train, ]
>> >
>> > E.      Use the naiveBayes function in the e1071 package to fit
the model.
>> >
>> >> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data =
InvestTechTrain)
>> >> prob <- predict(model, newdata = InvestTechVal, type =
?raw?)
>> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>> >
>> > F.      Use the confusionMatrix function in the caret package to
output the
>> > confusion matrix.
>> >
>> >> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[,
3]),mode >> > ?everything?, positive = ?1?)
>> >> accuracy <- confMtr$overall[1]
>> >> valError <- 1 ? accuracy
>> >> confMtr
>> >
>> > G.      Classify the 18 new (out-of-sample) readers using the
following
>> > code.
>> >> prob <- predict(model, newdata = outOfSample, type = ?raw?)
>> >> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>> >> cbind(pred, prob, outOfSample[, -3])
>> >
>> >
>> >
>>
>>
>> If your computations involve the parallel package then set.seed(seed)
>> may not produce repeatable results.  E.g.,
>>
>> > cl <- parallel::makeCluster(3)  # Create cluster with 3 nodes
on local
>> host
>> > set.seed(100); runif(2)
>> [1] 0.3077661 0.2576725
>> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>>         [,1]     [,2]     [,3]
>> [1,] 101.7779 102.5308 103.3459
>> [2,] 101.8128 102.6114 103.9102
>> >
>> > set.seed(100); runif(2)
>> [1] 0.3077661 0.2576725
>> > parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>>         [,1]     [,2]     [,3]
>> [1,] 101.1628 102.9643 103.2684
>> [2,] 101.9205 102.6937 103.7907
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Duncan Murdoch

2018-Mar-04 23:23 UTC

head link

[R] Random Seed Location

On 04/03/2018 5:54 PM, Henrik Bengtsson wrote:> The following helps identify when .GlobalEnv$.Random.seed has changed:
> 
> rng_tracker <- local({
>    last <- .GlobalEnv$.Random.seed
>    function(...) {
>      curr <- .GlobalEnv$.Random.seed
>      if (!identical(curr, last)) {
>        warning(".Random.seed changed")
>        last <<- curr
>      }
>      TRUE
>    }
> })
> 
> addTaskCallback(rng_tracker, name = "RNG tracker")
> 
> 
> EXAMPLE:
> 
>> sample.int(1L)
> [1] 1
> Warning: .Random.seed changed
> 
> This will help you find for instance:
> 
> ## Loading ggplot2 does not affect the RNG seed
>> loadNamespace("ggplot2")
> <environment: namespace:ggplot2>
> 
> ## But attaching it does
>> library("ggplot2")
> Warning: .Random.seed changed
> 
> which reveals:
> 
>> ggplot2:::.onAttach
> function (...)
> {
>      if (!interactive() || stats::runif(1) > 0.1)
>          return()
>      tips <- c("Need help? Try the ggplot2 mailing list:
> http://groups.google.com/group/ggplot2.",
>          "Find out what's changed in ggplot2 at
> http://github.com/tidyverse/ggplot2/releases.",
>          "Use suppressPackageStartupMessages() to eliminate package
> startup messages.",
>          "Stackoverflow is a great place to get help:
> http://stackoverflow.com/tags/ggplot2.",
>          "Need help getting started? Try the cookbook for R:
> http://www.cookbook-r.com/Graphs/",
>          "Want to understand how all the pieces fit together? Buy the
> ggplot2 book: http://ggplot2.org/book/")
>      tip <- sample(tips, 1)
>      packageStartupMessage(paste(strwrap(tip), collapse = "\n"))
> }
> <environment: namespace:ggplot2>
> 
> There are probably many case of this in different R packages.
> 
> 
> R WISH:
> 
> There could be a
> 
> preserveRandomSeed({
>    tip <- sample(tips, 1)
> })
> 
> function in R for these type of random needs where true random
> properties are non-critical.  This type of
> "draw-a-random-number-and-reset-the-seed" is for instance used in
> parallel:::initDefaultClusterOptions() which is called when the
> 'parallel' package is loaded:
> 
> seed <- .GlobalEnv$.Random.seed
> ran1 <- sample.int(.Machine$integer.max - 1L, 1L) / .Machine$integer.max
> port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1)
> if(is.null(seed)) ## there was none, initially
> rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
> else # reset
> assign(".Random.seed", seed, envir = .GlobalEnv, inherits =
FALSE)
An issue is that .Random.seed doesn't contain the full state of the RNG 
system, so restoring it doesn't necessarily lead to an identical 
sequence of output.  The only way to guarantee the sequence will repeat 
is to call set.seed(n), and that only leads to a tiny fraction of 
possible states.

Expanding .Random.seed so that it does contain the full state would be a 
good idea, and that would make your preserveRandomSeed really easy to write.

Here's a demo that .Random.seed is not enough:

 > set.seed(123, normal.kind = "Box-Muller")
 > rnorm(1)
[1] -0.1613431
 > save <- .Random.seed
 > rnorm(1)
[1] 0.6706031
 > .Random.seed <- save
 > rnorm(1)
[1] -0.4194403

If .Random.seed were the complete state, the 2nd and 3rd rnorm results 
would be the same.

Duncan Murdoch

> 
> /Henrik
> 
> On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <gwblack001 at
sbcglobal.net> wrote:
>> Thank you, everybody, who replied!  I appreciate your valuable advise! 
I will move the location of the set.seed() command to after all packages have
been installed and loaded.
>>
>> Best regards,
>> Gary
>>
>> Sent from my iPad
>>
>>> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert902 at
gmail.com> wrote:
>>>
>>> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at
sbcglobal.net>
>>> wrote:
>>>
>>> (Sorry to be a bit slow responding.)
>>>
>>> You have not supplied a complete example, which would be good in
this case because what you are suggesting could be a serious bug in R or a
package. Serious journals require reproducibility these days. For example, JSS
is very clear on this point.
>>>
>>> To your question
>>>> My question simply is:  should the location of the set.seed
command matter,
>>>> provided that it is applied before any commands which involve
randomness
>>>> (such as partitioning)?
>>>
>>> the answer is no, it should not matter. But the proviso is
important.
>>>
>>> You can determine where things are messing up using something like
>>>
>>> set.seed(654321)
>>> zk <- RNGkind()        # [1] "Mersenne-Twister"
"Inversion"
>>> zk
>>> z <- runif(2)
>>> z
>>> set.seed(654321)
>>>
>>> #  install.packages(c('caret', 'ggplot2',
'e1071'))
>>> library(caret)
>>> all(runif(2)  == z)   # should be true but it is not always
>>>
>>> set.seed(654321)
>>> library(ggplot2)
>>> all(runif(2)  == z)   # should be true
>>>
>>> set.seed(654321)
>>> library(e1071)
>>> all(runif(2)  == z)   # should be true
>>>
>>> all(RNGkind() == zk)  # should be true
>>>
>>> On my computer package caret seems to sometimes, but not always, do
something that advances or changes the RNG. So you will need to set the seed
after that package is loaded if you want reproducibility.
>>>
>>> As Bill Dunlap points out, parallel can introduce much more
complicated issues. If you are in fact using parallel then we really need a new
thread with a better subject line, and the discussion will get much messier.
>>>
>>> The short answer is that, yes you should be able to get
reproducible results with parallel computing. If you cannot then you are almost
certainly doing something wrong. To publish you really must have reproducible
results.
>>>
>>> In the example that Bill gave, I think the problem is that
set.seed() only resets the seed in the main thread, the nodes continue to
operate with unreset RNG. To demonstrate this to yourself you can do
>>>
>>> library(parallel)
>>> cl <- parallel::makeCluster(3)
>>> parallel::clusterCall(cl, function()set.seed(100))
>>> parallel::clusterCall(cl, function()RNGkind())
>>> parallel::clusterCall(cl, function()runif(2)) # similar result from
all nodes
>>>                                                # [1] 0.3077661
0.2576725
>>>
>>> However, do *NOT* do that in real work. You will be getting the
same RNG stream from each node. If you are using random numbers and parallel you
need to read a lot more, and probably consider a variant of the
"L'Ecuyer" generator or something designed for parallel computing.
>>>
>>> One special point I will mention because it does not seem to be
widely
>>> appreciated: the number of nodes affects the random stream, so
recording the number of compute nodes along with the RNG and seed information is
important for reproducible results. This has the unfortunate consequence that an
experiment cannot be reproduced on a larger cluster. (If anyone knows
differently I would very much like to hear.)
>>>
>>> Paul Gilbert
>>>
>>>
>>>> Hi all,
>>>>
>>>> For some odd reason when running na?ve bayes, k-NN, etc., I get
slightly
>>>> different results (e.g., error rates, classification
probabilities) from
>>>> run
>>>> to run even though I am using the same random seed.
>>>>
>>>> Nothing else (input-wise) is changing, but my results are
somewhat
>>>> different
>>>> from run to run.  The only randomness should be in the
partitioning, and I
>>>> have set the seed before this point.
>>>>
>>>> My question simply is:  should the location of the set.seed
command matter,
>>>> provided that it is applied before any commands which involve
randomness
>>>> (such as partitioning)?
>>>>
>>>> If you need to see the code, it is below:
>>>>
>>>> Thank you,
>>>> Gary
>>>>
>>>>
>>>> A.      Separate the original (in-sample) data from the new
(out-of-sample)
>>>> data.  Set a random seed.
>>>>
>>>>> InvestTech <- as.data.frame(InvestTechRevised)
>>>>> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
>>>>> InvestTech <- InvestTech[1:5000, ]
>>>>> set.seed(654321)
>>>>
>>>> B.      Install and load the caret, ggplot2 and e1071 packages.
>>>>
>>>>> install.packages(?caret?)
>>>>> install.packages(?ggplot2?)
>>>>> install.packages(?e1071?)
>>>>> library(caret)
>>>>> library(ggplot2)
>>>>> library(e1071)
>>>>
>>>> C.      Bin the predictor variables with approximately equal
counts using
>>>> the cut_number function from the ggplot2 package.  We will use
20 bins.
>>>>
>>>>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
>>>>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
>>>>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
>>>>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
>>>>
>>>> D.      Partition the original (in-sample) data into 60%
training and 40%
>>>> validation sets.
>>>>
>>>>> n <- nrow(InvestTech)
>>>>> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
>>>>> InvestTechTrain <- InvestTech[train, ]
>>>>> InvestTechVal <- InvestTech[-train, ]
>>>>
>>>> E.      Use the naiveBayes function in the e1071 package to fit
the model.
>>>>
>>>>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data =
InvestTechTrain)
>>>>> prob <- predict(model, newdata = InvestTechVal, type =
?raw?)
>>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>>>
>>>> F.      Use the confusionMatrix function in the caret package
to output the
>>>> confusion matrix.
>>>>
>>>>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[,
3]),mode >>>> ?everything?, positive = ?1?)
>>>>> accuracy <- confMtr$overall[1]
>>>>> valError <- 1 ? accuracy
>>>>> confMtr
>>>>
>>>> G.      Classify the 18 new (out-of-sample) readers using the
following
>>>> code.
>>>>> prob <- predict(model, newdata = outOfSample, type =
?raw?)
>>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>>>> cbind(pred, prob, outOfSample[, -3])
>>>>
>>>>
>>>>
>>>
>>>
>>> If your computations involve the parallel package then
set.seed(seed)
>>> may not produce repeatable results.  E.g.,
>>>
>>>> cl <- parallel::makeCluster(3)  # Create cluster with 3
nodes on local
>>> host
>>>> set.seed(100); runif(2)
>>> [1] 0.3077661 0.2576725
>>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>>>          [,1]     [,2]     [,3]
>>> [1,] 101.7779 102.5308 103.3459
>>> [2,] 101.8128 102.6114 103.9102
>>>>
>>>> set.seed(100); runif(2)
>>> [1] 0.3077661 0.2576725
>>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>>>          [,1]     [,2]     [,3]
>>> [1,] 101.1628 102.9643 103.2684
>>> [2,] 101.9205 102.6937 103.7907
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Henrik Bengtsson

2018-Mar-05 00:14 UTC

head link

[R] Random Seed Location

On Sun, Mar 4, 2018 at 3:23 PM, Duncan Murdoch <murdoch.duncan at
gmail.com> wrote:> On 04/03/2018 5:54 PM, Henrik Bengtsson wrote:
>>
>> The following helps identify when .GlobalEnv$.Random.seed has changed:
>>
>> rng_tracker <- local({
>>    last <- .GlobalEnv$.Random.seed
>>    function(...) {
>>      curr <- .GlobalEnv$.Random.seed
>>      if (!identical(curr, last)) {
>>        warning(".Random.seed changed")
>>        last <<- curr
>>      }
>>      TRUE
>>    }
>> })
>>
>> addTaskCallback(rng_tracker, name = "RNG tracker")
>>
>>
>> EXAMPLE:
>>
>>> sample.int(1L)
>>
>> [1] 1
>> Warning: .Random.seed changed
>>
>> This will help you find for instance:
>>
>> ## Loading ggplot2 does not affect the RNG seed
>>>
>>> loadNamespace("ggplot2")
>>
>> <environment: namespace:ggplot2>
>>
>> ## But attaching it does
>>>
>>> library("ggplot2")
>>
>> Warning: .Random.seed changed
>>
>> which reveals:
>>
>>> ggplot2:::.onAttach
>>
>> function (...)
>> {
>>      if (!interactive() || stats::runif(1) > 0.1)
>>          return()
>>      tips <- c("Need help? Try the ggplot2 mailing list:
>> http://groups.google.com/group/ggplot2.",
>>          "Find out what's changed in ggplot2 at
>> http://github.com/tidyverse/ggplot2/releases.",
>>          "Use suppressPackageStartupMessages() to eliminate
package
>> startup messages.",
>>          "Stackoverflow is a great place to get help:
>> http://stackoverflow.com/tags/ggplot2.",
>>          "Need help getting started? Try the cookbook for R:
>> http://www.cookbook-r.com/Graphs/",
>>          "Want to understand how all the pieces fit together? Buy
the
>> ggplot2 book: http://ggplot2.org/book/")
>>      tip <- sample(tips, 1)
>>      packageStartupMessage(paste(strwrap(tip), collapse =
"\n"))
>> }
>> <environment: namespace:ggplot2>
>>
>> There are probably many case of this in different R packages.
>>
>>
>> R WISH:
>>
>> There could be a
>>
>> preserveRandomSeed({
>>    tip <- sample(tips, 1)
>> })
>>
>> function in R for these type of random needs where true random
>> properties are non-critical.  This type of
>> "draw-a-random-number-and-reset-the-seed" is for instance
used in
>> parallel:::initDefaultClusterOptions() which is called when the
>> 'parallel' package is loaded:
>>
>> seed <- .GlobalEnv$.Random.seed
>> ran1 <- sample.int(.Machine$integer.max - 1L, 1L) /
.Machine$integer.max
>> port <- 11000 + 1000 * ((ran1 + unclass(Sys.time()) / 300) %% 1)
>> if(is.null(seed)) ## there was none, initially
>> rm( ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
>> else # reset
>> assign(".Random.seed", seed, envir = .GlobalEnv, inherits =
FALSE)
>
>
> An issue is that .Random.seed doesn't contain the full state of the RNG
> system, so restoring it doesn't necessarily lead to an identical
sequence of
> output.  The only way to guarantee the sequence will repeat is to call
> set.seed(n), and that only leads to a tiny fraction of possible states.
>
> Expanding .Random.seed so that it does contain the full state would be a
> good idea, and that would make your preserveRandomSeed really easy to
write.
>
> Here's a demo that .Random.seed is not enough:
>
>> set.seed(123, normal.kind = "Box-Muller")
>> rnorm(1)
> [1] -0.1613431
>> save <- .Random.seed
>> rnorm(1)
> [1] 0.6706031
>> .Random.seed <- save
>> rnorm(1)
> [1] -0.4194403
>
> If .Random.seed were the complete state, the 2nd and 3rd rnorm results
would
> be the same.
Ah... good point - I forgot about that "oddity", which is documented
in help(".Random.seed"):

".Random.seed saves the seed set for the uniform random-number
generator, at least for the system generators. It does not necessarily
save the state of other generators, and in particular does not save
the state of the Box?Muller normal generator. If you want to reproduce
work later, call set.seed (preferably with explicit values for kind
and normal.kind) rather than set .Random.seed."

So, this is is only for some of the RNG kinds.  Is the reason for this
limitation that it is not possible for R (not even the R internals) to
get hold of some of the RNG generators?  In other words, it is
unlikely to ever be fixed?

Since there is no bijective function to infer `x` such that
`set.seed(x)` resets the RNG state properly, the best we have is
'.Random.seed' for resetting it.  Would a start be to implement
preserveRandomSeed() for supported RNGkind():s and give an informative
warning in other cases?

/Henrik
>
> Duncan Murdoch
>
>
>
>>
>> /Henrik
>>
>> On Sun, Mar 4, 2018 at 1:40 PM, Gary Black <gwblack001 at
sbcglobal.net>
>> wrote:
>>>
>>> Thank you, everybody, who replied!  I appreciate your valuable
advise!  I
>>> will move the location of the set.seed() command to after all
packages have
>>> been installed and loaded.
>>>
>>> Best regards,
>>> Gary
>>>
>>> Sent from my iPad
>>>
>>>> On Mar 4, 2018, at 12:18 PM, Paul Gilbert <pgilbert902 at
gmail.com> wrote:
>>>>
>>>> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at
sbcglobal.net>
>>>> wrote:
>>>>
>>>> (Sorry to be a bit slow responding.)
>>>>
>>>> You have not supplied a complete example, which would be good
in this
>>>> case because what you are suggesting could be a serious bug in
R or a
>>>> package. Serious journals require reproducibility these days.
For example,
>>>> JSS is very clear on this point.
>>>>
>>>> To your question
>>>>>
>>>>> My question simply is:  should the location of the set.seed
command
>>>>> matter,
>>>>> provided that it is applied before any commands which
involve
>>>>> randomness
>>>>> (such as partitioning)?
>>>>
>>>>
>>>> the answer is no, it should not matter. But the proviso is
important.
>>>>
>>>> You can determine where things are messing up using something
like
>>>>
>>>> set.seed(654321)
>>>> zk <- RNGkind()        # [1] "Mersenne-Twister"
"Inversion"
>>>> zk
>>>> z <- runif(2)
>>>> z
>>>> set.seed(654321)
>>>>
>>>> #  install.packages(c('caret', 'ggplot2',
'e1071'))
>>>> library(caret)
>>>> all(runif(2)  == z)   # should be true but it is not always
>>>>
>>>> set.seed(654321)
>>>> library(ggplot2)
>>>> all(runif(2)  == z)   # should be true
>>>>
>>>> set.seed(654321)
>>>> library(e1071)
>>>> all(runif(2)  == z)   # should be true
>>>>
>>>> all(RNGkind() == zk)  # should be true
>>>>
>>>> On my computer package caret seems to sometimes, but not
always, do
>>>> something that advances or changes the RNG. So you will need to
set the seed
>>>> after that package is loaded if you want reproducibility.
>>>>
>>>> As Bill Dunlap points out, parallel can introduce much more
complicated
>>>> issues. If you are in fact using parallel then we really need a
new thread
>>>> with a better subject line, and the discussion will get much
messier.
>>>>
>>>> The short answer is that, yes you should be able to get
reproducible
>>>> results with parallel computing. If you cannot then you are
almost certainly
>>>> doing something wrong. To publish you really must have
reproducible results.
>>>>
>>>> In the example that Bill gave, I think the problem is that
set.seed()
>>>> only resets the seed in the main thread, the nodes continue to
operate with
>>>> unreset RNG. To demonstrate this to yourself you can do
>>>>
>>>> library(parallel)
>>>> cl <- parallel::makeCluster(3)
>>>> parallel::clusterCall(cl, function()set.seed(100))
>>>> parallel::clusterCall(cl, function()RNGkind())
>>>> parallel::clusterCall(cl, function()runif(2)) # similar result
from all
>>>> nodes
>>>>                                                # [1] 0.3077661
0.2576725
>>>>
>>>> However, do *NOT* do that in real work. You will be getting the
same RNG
>>>> stream from each node. If you are using random numbers and
parallel you need
>>>> to read a lot more, and probably consider a variant of the
"L'Ecuyer"
>>>> generator or something designed for parallel computing.
>>>>
>>>> One special point I will mention because it does not seem to be
widely
>>>> appreciated: the number of nodes affects the random stream, so
recording
>>>> the number of compute nodes along with the RNG and seed
information is
>>>> important for reproducible results. This has the unfortunate
consequence
>>>> that an experiment cannot be reproduced on a larger cluster.
(If anyone
>>>> knows differently I would very much like to hear.)
>>>>
>>>> Paul Gilbert
>>>>
>>>>
>>>>> Hi all,
>>>>>
>>>>> For some odd reason when running na?ve bayes, k-NN, etc., I
get
>>>>> slightly
>>>>> different results (e.g., error rates, classification
probabilities)
>>>>> from
>>>>> run
>>>>> to run even though I am using the same random seed.
>>>>>
>>>>> Nothing else (input-wise) is changing, but my results are
somewhat
>>>>> different
>>>>> from run to run.  The only randomness should be in the
partitioning,
>>>>> and I
>>>>> have set the seed before this point.
>>>>>
>>>>> My question simply is:  should the location of the set.seed
command
>>>>> matter,
>>>>> provided that it is applied before any commands which
involve
>>>>> randomness
>>>>> (such as partitioning)?
>>>>>
>>>>> If you need to see the code, it is below:
>>>>>
>>>>> Thank you,
>>>>> Gary
>>>>>
>>>>>
>>>>> A.      Separate the original (in-sample) data from the new
>>>>> (out-of-sample)
>>>>> data.  Set a random seed.
>>>>>
>>>>>> InvestTech <- as.data.frame(InvestTechRevised)
>>>>>> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
>>>>>> InvestTech <- InvestTech[1:5000, ]
>>>>>> set.seed(654321)
>>>>>
>>>>>
>>>>> B.      Install and load the caret, ggplot2 and e1071
packages.
>>>>>
>>>>>> install.packages(?caret?)
>>>>>> install.packages(?ggplot2?)
>>>>>> install.packages(?e1071?)
>>>>>> library(caret)
>>>>>> library(ggplot2)
>>>>>> library(e1071)
>>>>>
>>>>>
>>>>> C.      Bin the predictor variables with approximately
equal counts
>>>>> using
>>>>> the cut_number function from the ggplot2 package.  We will
use 20 bins.
>>>>>
>>>>>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n =
20)
>>>>>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n =
20)
>>>>>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n =
20)
>>>>>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n =
20)
>>>>>
>>>>>
>>>>> D.      Partition the original (in-sample) data into 60%
training and
>>>>> 40%
>>>>> validation sets.
>>>>>
>>>>>> n <- nrow(InvestTech)
>>>>>> train <- sample(1:n, size = 0.6 * n, replace =
FALSE)
>>>>>> InvestTechTrain <- InvestTech[train, ]
>>>>>> InvestTechVal <- InvestTech[-train, ]
>>>>>
>>>>>
>>>>> E.      Use the naiveBayes function in the e1071 package to
fit the
>>>>> model.
>>>>>
>>>>>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ .,
data >>>>>> InvestTechTrain)
>>>>>> prob <- predict(model, newdata = InvestTechVal, type
= ?raw?)
>>>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>>>>
>>>>>
>>>>> F.      Use the confusionMatrix function in the caret
package to output
>>>>> the
>>>>> confusion matrix.
>>>>>
>>>>>> confMtr <-
confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode >>>>>
>>>>> ?everything?, positive = ?1?)
>>>>>>
>>>>>> accuracy <- confMtr$overall[1]
>>>>>> valError <- 1 ? accuracy
>>>>>> confMtr
>>>>>
>>>>>
>>>>> G.      Classify the 18 new (out-of-sample) readers using
the following
>>>>> code.
>>>>>>
>>>>>> prob <- predict(model, newdata = outOfSample, type =
?raw?)
>>>>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>>>>> cbind(pred, prob, outOfSample[, -3])
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> If your computations involve the parallel package then
set.seed(seed)
>>>> may not produce repeatable results.  E.g.,
>>>>
>>>>> cl <- parallel::makeCluster(3)  # Create cluster with 3
nodes on local
>>>>
>>>> host
>>>>>
>>>>> set.seed(100); runif(2)
>>>>
>>>> [1] 0.3077661 0.2576725
>>>>>
>>>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i,
i+1))
>>>>
>>>>          [,1]     [,2]     [,3]
>>>> [1,] 101.7779 102.5308 103.3459
>>>> [2,] 101.8128 102.6114 103.9102
>>>>>
>>>>>
>>>>> set.seed(100); runif(2)
>>>>
>>>> [1] 0.3077661 0.2576725
>>>>>
>>>>> parallel::parSapply(cl, 101:103, function(i)runif(2, i,
i+1))
>>>>
>>>>          [,1]     [,2]     [,3]
>>>> [1,] 101.1628 102.9643 103.2684
>>>> [2,] 101.9205 102.6937 103.7907
>>>>
>>>>
>>>> Bill Dunlap
>>>> TIBCO Software
>>>> wdunlap tibco.com
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

Reasonably Related Threads

Search for more seemingly similar threads

R help - Mar 2018 - Random Seed Location

[R] Random Seed Location

[R] Random Seed Location

[R] Random Seed Location

Reasonably Related Threads