Hi all, For some odd reason when running na?ve bayes, k-NN, etc., I get slightly different results (e.g., error rates, classification probabilities) from run to run even though I am using the same random seed. Nothing else (input-wise) is changing, but my results are somewhat different from run to run. The only randomness should be in the partitioning, and I have set the seed before this point. My question simply is: should the location of the set.seed command matter, provided that it is applied before any commands which involve randomness (such as partitioning)? If you need to see the code, it is below: Thank you, Gary A. Separate the original (in-sample) data from the new (out-of-sample) data. Set a random seed.> InvestTech <- as.data.frame(InvestTechRevised) > outOfSample <- InvestTech[5001:nrow(InvestTech), ] > InvestTech <- InvestTech[1:5000, ] > set.seed(654321)B. Install and load the caret, ggplot2 and e1071 packages.> install.packages(?caret?) > install.packages(?ggplot2?) > install.packages(?e1071?) > library(caret) > library(ggplot2) > library(e1071)C. Bin the predictor variables with approximately equal counts using the cut_number function from the ggplot2 package. We will use 20 bins.> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)D. Partition the original (in-sample) data into 60% training and 40% validation sets.> n <- nrow(InvestTech) > train <- sample(1:n, size = 0.6 * n, replace = FALSE) > InvestTechTrain <- InvestTech[train, ] > InvestTechVal <- InvestTech[-train, ]E. Use the naiveBayes function in the e1071 package to fit the model.> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) > prob <- predict(model, newdata = InvestTechVal, type = ?raw?) > pred <- ifelse(prob[, 2] >= 0.3, 1, 0)F. Use the confusionMatrix function in the caret package to output the confusion matrix.> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode ?everything?, positive = ?1?) > accuracy <- confMtr$overall[1] > valError <- 1 ? accuracy > confMtrG. Classify the 18 new (out-of-sample) readers using the following code.> prob <- predict(model, newdata = outOfSample, type = ?raw?) > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > cbind(pred, prob, outOfSample[, -3])--- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
In case you don't get an answer from someone more knowledgeable: 1. I don't know. 2. But it is possible that other packages that are loaded after set.seed() fool with the RNG. 3. So I would call set.seed just before you invoke each random number generation to be safe. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at sbcglobal.net> wrote:> Hi all, > > For some odd reason when running na?ve bayes, k-NN, etc., I get slightly > different results (e.g., error rates, classification probabilities) from > run > to run even though I am using the same random seed. > > Nothing else (input-wise) is changing, but my results are somewhat > different > from run to run. The only randomness should be in the partitioning, and I > have set the seed before this point. > > My question simply is: should the location of the set.seed command matter, > provided that it is applied before any commands which involve randomness > (such as partitioning)? > > If you need to see the code, it is below: > > Thank you, > Gary > > > A. Separate the original (in-sample) data from the new (out-of-sample) > data. Set a random seed. > > > InvestTech <- as.data.frame(InvestTechRevised) > > outOfSample <- InvestTech[5001:nrow(InvestTech), ] > > InvestTech <- InvestTech[1:5000, ] > > set.seed(654321) > > B. Install and load the caret, ggplot2 and e1071 packages. > > > install.packages(?caret?) > > install.packages(?ggplot2?) > > install.packages(?e1071?) > > library(caret) > > library(ggplot2) > > library(e1071) > > C. Bin the predictor variables with approximately equal counts using > the cut_number function from the ggplot2 package. We will use 20 bins. > > > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > D. Partition the original (in-sample) data into 60% training and 40% > validation sets. > > > n <- nrow(InvestTech) > > train <- sample(1:n, size = 0.6 * n, replace = FALSE) > > InvestTechTrain <- InvestTech[train, ] > > InvestTechVal <- InvestTech[-train, ] > > E. Use the naiveBayes function in the e1071 package to fit the model. > > > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) > > prob <- predict(model, newdata = InvestTechVal, type = ?raw?) > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > F. Use the confusionMatrix function in the caret package to output the > confusion matrix. > > > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode > ?everything?, positive = ?1?) > > accuracy <- confMtr$overall[1] > > valError <- 1 ? accuracy > > confMtr > > G. Classify the 18 new (out-of-sample) readers using the following > code. > > prob <- predict(model, newdata = outOfSample, type = ?raw?) > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > cbind(pred, prob, outOfSample[, -3]) > > > > > > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
If your computations involve the parallel package then set.seed(seed) may not produce repeatable results. E.g.,> cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on localhost> set.seed(100); runif(2)[1] 0.3077661 0.2576725> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))[,1] [,2] [,3] [1,] 101.7779 102.5308 103.3459 [2,] 101.8128 102.6114 103.9102> > set.seed(100); runif(2)[1] 0.3077661 0.2576725> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))[,1] [,2] [,3] [1,] 101.1628 102.9643 103.2684 [2,] 101.9205 102.6937 103.7907 Bill Dunlap TIBCO Software wdunlap tibco.com On Mon, Feb 26, 2018 at 4:30 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> In case you don't get an answer from someone more knowledgeable: > > 1. I don't know. > 2. But it is possible that other packages that are loaded after set.seed() > fool with the RNG. > 3. So I would call set.seed just before you invoke each random number > generation to be safe. > > Cheers, > Bert > > > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at sbcglobal.net> > wrote: > > > Hi all, > > > > For some odd reason when running na?ve bayes, k-NN, etc., I get slightly > > different results (e.g., error rates, classification probabilities) from > > run > > to run even though I am using the same random seed. > > > > Nothing else (input-wise) is changing, but my results are somewhat > > different > > from run to run. The only randomness should be in the partitioning, and > I > > have set the seed before this point. > > > > My question simply is: should the location of the set.seed command > matter, > > provided that it is applied before any commands which involve randomness > > (such as partitioning)? > > > > If you need to see the code, it is below: > > > > Thank you, > > Gary > > > > > > A. Separate the original (in-sample) data from the new > (out-of-sample) > > data. Set a random seed. > > > > > InvestTech <- as.data.frame(InvestTechRevised) > > > outOfSample <- InvestTech[5001:nrow(InvestTech), ] > > > InvestTech <- InvestTech[1:5000, ] > > > set.seed(654321) > > > > B. Install and load the caret, ggplot2 and e1071 packages. > > > > > install.packages(?caret?) > > > install.packages(?ggplot2?) > > > install.packages(?e1071?) > > > library(caret) > > > library(ggplot2) > > > library(e1071) > > > > C. Bin the predictor variables with approximately equal counts using > > the cut_number function from the ggplot2 package. We will use 20 bins. > > > > > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) > > > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) > > > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) > > > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) > > > > D. Partition the original (in-sample) data into 60% training and 40% > > validation sets. > > > > > n <- nrow(InvestTech) > > > train <- sample(1:n, size = 0.6 * n, replace = FALSE) > > > InvestTechTrain <- InvestTech[train, ] > > > InvestTechVal <- InvestTech[-train, ] > > > > E. Use the naiveBayes function in the e1071 package to fit the > model. > > > > > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data > InvestTechTrain) > > > prob <- predict(model, newdata = InvestTechVal, type = ?raw?) > > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > > > F. Use the confusionMatrix function in the caret package to output > the > > confusion matrix. > > > > > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode > > ?everything?, positive = ?1?) > > > accuracy <- confMtr$overall[1] > > > valError <- 1 ? accuracy > > > confMtr > > > > G. Classify the 18 new (out-of-sample) readers using the following > > code. > > > prob <- predict(model, newdata = outOfSample, type = ?raw?) > > > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) > > > cbind(pred, prob, outOfSample[, -3]) > > > > > > > > > > > > > > > > --- > > This email has been checked for viruses by Avast antivirus software. > > https://www.avast.com/antivirus > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
I am willing to go out on that limb and say the answer to the OP question is yes, the RN sequence in R should be reproducible. I agree though that it doesn't look like he is actually taking care not to run code that would disturb the generator. -- Sent from my phone. Please excuse my brevity. On February 26, 2018 4:30:47 PM PST, Bert Gunter <bgunter.4567 at gmail.com> wrote:>In case you don't get an answer from someone more knowledgeable: > >1. I don't know. >2. But it is possible that other packages that are loaded after >set.seed() >fool with the RNG. >3. So I would call set.seed just before you invoke each random number >generation to be safe. > >Cheers, >Bert > > > > >Bert Gunter > >"The trouble with having an open mind is that people keep coming along >and >sticking things into it." >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at sbcglobal.net> >wrote: > >> Hi all, >> >> For some odd reason when running na?ve bayes, k-NN, etc., I get >slightly >> different results (e.g., error rates, classification probabilities) >from >> run >> to run even though I am using the same random seed. >> >> Nothing else (input-wise) is changing, but my results are somewhat >> different >> from run to run. The only randomness should be in the partitioning, >and I >> have set the seed before this point. >> >> My question simply is: should the location of the set.seed command >matter, >> provided that it is applied before any commands which involve >randomness >> (such as partitioning)? >> >> If you need to see the code, it is below: >> >> Thank you, >> Gary >> >> >> A. Separate the original (in-sample) data from the new >(out-of-sample) >> data. Set a random seed. >> >> > InvestTech <- as.data.frame(InvestTechRevised) >> > outOfSample <- InvestTech[5001:nrow(InvestTech), ] >> > InvestTech <- InvestTech[1:5000, ] >> > set.seed(654321) >> >> B. Install and load the caret, ggplot2 and e1071 packages. >> >> > install.packages(?caret?) >> > install.packages(?ggplot2?) >> > install.packages(?e1071?) >> > library(caret) >> > library(ggplot2) >> > library(e1071) >> >> C. Bin the predictor variables with approximately equal counts >using >> the cut_number function from the ggplot2 package. We will use 20 >bins. >> >> > InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >> > InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >> > outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >> > outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >> >> D. Partition the original (in-sample) data into 60% training and >40% >> validation sets. >> >> > n <- nrow(InvestTech) >> > train <- sample(1:n, size = 0.6 * n, replace = FALSE) >> > InvestTechTrain <- InvestTech[train, ] >> > InvestTechVal <- InvestTech[-train, ] >> >> E. Use the naiveBayes function in the e1071 package to fit the >model. >> >> > model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data >InvestTechTrain) >> > prob <- predict(model, newdata = InvestTechVal, type = ?raw?) >> > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> >> F. Use the confusionMatrix function in the caret package to >output the >> confusion matrix. >> >> > confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode >> ?everything?, positive = ?1?) >> > accuracy <- confMtr$overall[1] >> > valError <- 1 ? accuracy >> > confMtr >> >> G. Classify the 18 new (out-of-sample) readers using the >following >> code. >> > prob <- predict(model, newdata = outOfSample, type = ?raw?) >> > pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> > cbind(pred, prob, outOfSample[, -3]) >> >> >> >> >> >> >> >> --- >> This email has been checked for viruses by Avast antivirus software. >> https://www.avast.com/antivirus >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.