thr3ads.net - R help - [R] RandomForest question [Jul 2005]

If this information is useful, please help other people find it:
Share via:

Arne.Muller@sanofi-aventis.com

2005-Jul-21 13:20 UTC

[R] RandomForest question

Hello,

I'm trying to find out the optimal number of splits (mtry parameter) for a
randomForest classification. The classification is binary and there are 32
explanatory variables (mostly factors with each up to 4 levels but also some
numeric variables) and 575 cases.

I've seen that although there are only 32 explanatory variables the best
classification performance is reached when choosing mtry=80. How is it possible
that more variables can used than there are in columns the data frame?

	thanks for your help
	+ kind regards,

	Arne




	[[alternative HTML version deleted]]

Liaw, Andy

2005-Jul-21 14:16 UTC

head link

[R] RandomForest question

> From: Arne.Muller at sanofi-aventis.com
> 
> Hello,
> 
> I'm trying to find out the optimal number of splits (mtry 
> parameter) for a randomForest classification. The 
> classification is binary and there are 32 explanatory 
> variables (mostly factors with each up to 4 levels but also 
> some numeric variables) and 575 cases.
> 
> I've seen that although there are only 32 explanatory 
> variables the best classification performance is reached when 
> choosing mtry=80. How is it possible that more variables can 
> used than there are in columns the data frame?
It's not.  The code for randomForest.default() has:

    ## Make sure mtry is in reasonable range.
    mtry <- max(1, min(p, round(mtry)))

so it silently sets mtry to number of predictors if it's too large.
As an example:
> library(randomForest)randomForest 4.5-12 
Type rfNews() to see new features/changes/bug fixes.> iris.rf = randomForest(Species ~ ., iris, mtry=10)
> iris.rf$mtry[1] 4

I should probably add a warning in such cases...

Andy

 > 	thanks for your help
> 	+ kind regards,
> 
> 	Arne
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 
>

Uwe Ligges

2005-Jul-21 14:31 UTC

head link

[R] RandomForest question

Arne.Muller at sanofi-aventis.com wrote:
> Hello,
> 
> I'm trying to find out the optimal number of splits (mtry parameter)
> for a randomForest classification. The classification is binary and
> there are 32 explanatory variables (mostly factors with each up to 4
> levels but also some numeric variables) and 575 cases.
> 
> I've seen that although there are only 32 explanatory variables the
> best classification performance is reached when choosing mtry=80. How
> is it possible that more variables can used than there are in columns
> the data frame?
If some of the variables are factors, dummy variables are generated and 
you get a larger number of variables in the later process.

Uwe Ligges

> thanks for your help + kind regards,
> 
> Arne
> 
> 
> 
> 
> [[alternative HTML version deleted]]
> 
> ______________________________________________ 
> R-help at stat.math.ethz.ch mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> posting guide! http://www.R-project.org/posting-guide.html

Liaw, Andy

2005-Jul-21 16:59 UTC

head link

[R] RandomForest question

See the tuneRF() function in the package for an implementation of 
the strategy recommended by Breiman & Cutler.

BTW, "randomForest" is only for the R package.  See Breiman's 
web page for notice on trademarks.

Andy
> From: Weiwei Shi 
> 
> Hi,
> I found the following lines from Leo's randomForest, and I am not sure
> if it can be applied here but just tried to help:
> 
> mtry0 = the number of variables to split on at each node. Default is
> the square root of mdim. ATTENTION! DO NOT USE THE DEFAULT VALUES OF
> MTRY0 IF YOU WANT TO OPTIMIZE THE PERFORMANCE OF RANDOM FORESTS. TRY
> DIFFERENT VALUES-GROW 20-30 TREES, AND SELECT THE VALUE OF MTRY THAT
> GIVES THE SMALLEST OOB ERROR RATE.
> 
> mdim is the number of predicators.
> 
> HTH,
> 
> weiwei
> 
> On 7/21/05, Liaw, Andy <andy_liaw at merck.com> wrote:
> > > From: Arne.Muller at sanofi-aventis.com
> > >
> > > Hello,
> > >
> > > I'm trying to find out the optimal number of splits (mtry
> > > parameter) for a randomForest classification. The
> > > classification is binary and there are 32 explanatory
> > > variables (mostly factors with each up to 4 levels but also
> > > some numeric variables) and 575 cases.
> > >
> > > I've seen that although there are only 32 explanatory
> > > variables the best classification performance is reached when
> > > choosing mtry=80. How is it possible that more variables can
> > > used than there are in columns the data frame?
> > 
> > It's not.  The code for randomForest.default() has:
> > 
> >     ## Make sure mtry is in reasonable range.
> >     mtry <- max(1, min(p, round(mtry)))
> > 
> > so it silently sets mtry to number of predictors if it's too
large.
> > As an example:
> > 
> > > library(randomForest)
> > randomForest 4.5-12
> > Type rfNews() to see new features/changes/bug fixes.
> > > iris.rf = randomForest(Species ~ ., iris, mtry=10)
> > > iris.rf$mtry
> > [1] 4
> > 
> > I should probably add a warning in such cases...
> > 
> > Andy
> > 
> > 
> > >       thanks for your help
> > >       + kind regards,
> > >
> > >       Arne
> > >
> > >
> > >
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> > >
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> > 
> 
> 
> -- 
> Weiwei Shi, Ph.D
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 
> 
>

Liaw, Andy

2005-Jul-21 20:06 UTC

head link

[R] RandomForest question

> From: Uwe Ligges
> 
> Arne.Muller at sanofi-aventis.com wrote:
> 
> > Hello,
> > 
> > I'm trying to find out the optimal number of splits (mtry
parameter)
> > for a randomForest classification. The classification is binary and
> > there are 32 explanatory variables (mostly factors with each up to 4
> > levels but also some numeric variables) and 575 cases.
> > 
> > I've seen that although there are only 32 explanatory variables
the
> > best classification performance is reached when choosing 
> mtry=80. How
> > is it possible that more variables can used than there are 
> in columns
> > the data frame?
> 
> If some of the variables are factors, dummy variables are 
> generated and 
> you get a larger number of variables in the later process.
No, unless the OP is using the formula interface with a version of the 
package from two years or so ago.  We got the first formula interface
by copying and modifying the one for svm() in e1071, and forgot the
fact that SVM needs that for dealing with factors, but not trees 
(especially not how the underlying RF code handles them).  This has
been correctly long ago.

Cheers,
Andy


 > Uwe Ligges
> 
> 
> > thanks for your help + kind regards,
> > 
> > Arne
> > 
> > 
> > 
> > 
> > [[alternative HTML version deleted]]
> > 
> > ______________________________________________ 
> > R-help at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> > posting guide! http://www.R-project.org/posting-guide.html
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
> 
>

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jul 2005 - RandomForest question

[R] RandomForest question

[R] RandomForest question

[R] RandomForest question

[R] RandomForest question

[R] RandomForest question

Apparently Analagous Threads