Hi, I would like to use a random Forest model to get an idea about which variables from a dataset may have some prognostic significance in a smallish study. The default for the number of trees seems to be 500. I tried changing the default to ntree=2000 or ntree=200 and the results appear identical. Have changed mtry from mtry=5 to mtry=6 successfully. Have seen same problem on both a Windows machine and our linux system running 2.8 and 2.9. Sample code follws. Thanks in advance for help. Mary> m1<-as.formula(paste("as.factor(EAD)~", paste(names(clin_b)[c(5,7,10:36 )], collapse="+"))) > m1as.factor(EAD) ~ R_AGE + R_BMI + ASCITES...1L. + EOTAXIN + GM.CSF + IFNa + IL.10 + IL.12.p40.p70 + IL.13 + IL.15 + IL.17 + IL.2 + IL.4 + IL.5 + IL.6 + IL.7 + IL.8 + IL1.RA + IL2.R + IP.10 + MCP.1 + MIG + MIP.1a + MIP.1b + RANTES + TNFa + Male + diagnosis + race> > > > > set.seed(12345) > rF.bsl<-randomForest(m1, data=clin_b, na.action=na.omit, mtry=6, n.tree=2000) > rF.bsl$ntree[1] 500> rF.bsl$mtry[1] 6> print(rF.bsl)Call: randomForest(formula = m1, data = clin_b, mtry = 6, n.tree = 2000, na.action = na.omit) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 6 OOB estimate of error rate: 39.66% Confusion matrix: 0 1 class.error 0 27 7 0.2058824 1 16 8 0.6666667> > > set.seed(12345) > rF.bsl<-randomForest(m1, data=clin_b, na.action=na.omit, mtry=6, n.tree=100) > rF.bsl$ntree[1] 500> rF.bsl$mtry[1] 6> print(rF.bsl)Call: randomForest(formula = m1, data = clin_b, mtry = 6, n.tree = 100, na.action = na.omit) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 6 OOB estimate of error rate: 39.66% Confusion matrix: 0 1 class.error 0 27 7 0.2058824 1 16 8 0.6666667> >
On Thu, Aug 13, 2009 at 11:11 PM, Mary Putt<mputt at mail.med.upenn.edu> wrote: Hi Mary,> I would like to use a random Forest model to get an idea about which variables from a dataset may have some prognostic significance in a smallish study. The default for the number of trees seems to be 500. I tried changing the default to ntree=2000 or ntree=200 and the results appear identical. Have changed mtry from mtry=5 to mtry=6 successfully. Have seen same problem on both a Windows machine and our linux system running 2.8 and 2.9.I don't think it's correct to call it a problem; it's more likely a feature! Try to take a look a Breiman's paper (in the "Machine Learning" journal), where he introduces random forests. I read it recently, and somewhere he explicitly mentions that ntree often may be set very low without lowering the performance. The random forest algorithm is very robust and apparently 500 trees are usually more than enough. Therefore you don't get better results by using 2000 trees, and often it doesn't affect the performance if you use fewer trees (e.g. 200). Best, Michael -- Michael Knudsen micknudsen at gmail.com http://lifeofknudsen.blogspot.com/
On Fri, Aug 14, 2009 at 1:43 PM, Mary Putt<mputt at mail.med.upenn.edu> wrote:> I'm not calling it a problem that the answer converges--i.e. that the algorithm is stable. but if you look at the example even though I've asked for 2000 or 200 tress, ntree=2000 or ntree=200, it still gives me 500 trees according to the output and identical results when you set the seed before the call. While results are expected to be similar they should not be identical if the number of trees was actuallly changed.Oops! You have written n.tree instead of ntree. Best, Michael -- Michael Knudsen micknudsen at gmail.com http://lifeofknudsen.blogspot.com/