David L. Van Brunt, Ph.D.
2007-Apr-29 01:13 UTC
[R] randomForest gives different results for formula call v. x, y methods. Why?
Just out of curiosity, I took the default "iris" example in the RF helpfile... but seeing the admonition against using the formula interface for large data sets, I wanted to play around a bit to see how the various options affected the output. Found something interesting I couldn't find documentation for... Just like the example...> set.seed(12) # to be sure I have reproducibility> form.rf<-randomForest(Species ~ ., data=iris) > form.rfCall: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 4 46 0.08> long.rf<-randomForest(x=iris[,1:4],y=iris[,5]) > long.rfCall: randomForest(x = iris[, 1:4], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06 (Now, if I had non-contiguous columns for predictors, I'd have to call it this way....)> long2.rf<-randomForest(x=iris[,c(1:4)],y=iris[,5]) > long2.rfCall: randomForest(x = iris[, c(1:4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 5.33% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 5 45 0.10 Any idea why these two should give different results? I can only figure that the seed, even though it's set, somehow gets altered by the use of a formula....> long3.rf<-randomForest(x=iris[,c(1,2,3,4)],y=iris[,5]) > long3.rfCall: randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 4 46 0.08 Either that or I'm calling it wrong in the long example, or else there's a bug. Not a life threatening situation, but I am curious as to the mechanics of this. I use that sort of column identification all the time and it seems to work OK, but here I get different results (form.rf v. long.rf or long2.rf) or not (form.rf v. long3.rf) depending how I call the function. Any insights? -- --------------------------------------- David L. Van Brunt, Ph.D. mailto:dlvanbrunt@gmail.com "If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy." --James Madison [[alternative HTML version deleted]]
Gavin Simpson
2007-Apr-29 13:38 UTC
[R] randomForest gives different results for formula call v. x, y methods. Why?
On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote:> Just out of curiosity, I took the default "iris" example in the RF > helpfile... > but seeing the admonition against using the formula interface for large data > sets, I wanted to play around a bit to see how the various options affected > the output. Found something interesting I couldn't find documentation for... > > Just like the example... > > set.seed(12) # to be sure I have reproducibilityNo differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with:> require(randomForest)Loading required package: randomForest randomForest 4.5-18 *if* I reset the seed before each call to randomForest. Your example code doesn't seem to be resetting the random seed before each run. As such, each run is using a different set of random variables at each bootstrap sample. E.g. runs all same with reset seed:> set.seed(12) > randomForest(Species ~ ., data=iris)Call: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06> set.seed(12) > randomForest(x=iris[,1:4],y=iris[,5])Call: randomForest(x = iris[, 1:4], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06> set.seed(12) > randomForest(x=iris[,c(1:4)],y=iris[,5])Call: randomForest(x = iris[, c(1:4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06> set.seed(12) > randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])Call: randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06 HTH G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC [f] +44 (0)20 7679 0565 UCL Department of Geography Pearson Building [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street London, UK [w] http://www.ucl.ac.uk/~ucfagls/ WC1E 6BT [w] http://www.freshwaters.org.uk/ %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%