Jia ZJ Zou
2010-May-25 09:51 UTC
[R] Need Help! Poor performance about randomForest for large data
Hi, dears, I am processing some data with 60 columns, and 286,730 rows. Most columns are numerical value, and some columns are categorical value. It turns out that: when ntree sets to the default value (500), it says "can not allocate a vector of 1.1 GB size"; And when I set ntree to be a very small number like 10, it will run for hours. I use the (x,y) rather than the (formula,data). My code:> sdata<-read.csv("D://zSignal Dump//XXXX//XXXX.csv") > sdata1<-subset(sdata,select=-38) > sdata2<-subset(sdata,select=38) > res<-randomForest(x=sdata1,y=sdata2,ntrees=10)Am I doing anything wrong? Or do you have other suggestions? Are there any other packages to do the same thing? I will appreciate if anyone can help me out, thanks! Thanks and Best regards, ------------------------------------------------ Jia, Zou (×Þ¼Î), Ph.D. IBM Research -- China Diamond Building, #19 Zhongguancun Software Park, 8 Dongbeiwang West Road, Haidian District, Beijing 100193, P.R. China Tel: +86 (10) 58748518 E-mail: jiazou@cn.ibm.com [[alternative HTML version deleted]]
Joris Meys
2010-May-25 11:01 UTC
[R] Need Help! Poor performance about randomForest for large data
Hi Jia,
without seeing the actual data, it's difficult to give solid options. But
it's quite normal this runs for hours : it has to make a whole lot of
decisions, and it can grow tremendous large trees with that amount of data.
Also the error is quite logic : you just can't store all those huge trees.
Try to set the following options in RandomForest :
mtry : number of variables selected at each split. Smaller number speeds up
things, but this effect will be not too big.
nodesize : this is the minimum node size. In default, it is 1 for
classification, meaning that you build a tree until every observation is in
a seperate leaf. In your case, this should be set waaaaaay higher.
maxnodes : this is the maximum number of nodes. Again, with the amount of
data you have, this number goes skyrocket and thus produces huge trees (you
can have more than 200.000 nodes... ). No need to do that, so you should set
it to a reasonable low amount.
Try this for example :
res <- randomForest(x=sdata1,y=sdata2,ntrees=500,
mtry=5, nodesize=100,maxnodes=60)
These trees assume that the minimum size of a group with similar
observations is 100. Sounds reasonable, it still gives you over 2800 groups
for a full tree. The maximum number of nodes I chose to allow that every
variable occurs once in the tree, although it doesn't have to be this way.
If you still get errors, play a bit more with those numbers.
Actually, you should do that anyway, regardless of memory and computation
time. RandomForest is known to have the danger of overfitting. Restricting
the tree size avoids this and gives you a more general fit.
Cheers
Joris
On Tue, May 25, 2010 at 11:51 AM, Jia ZJ Zou <jiazou@cn.ibm.com> wrote:
> Hi, dears,
>
> I am processing some data with 60 columns, and 286,730 rows.
> Most columns are numerical value, and some columns are categorical value.
>
> It turns out that: when ntree sets to the default value (500), it says
"can
> not allocate a vector of 1.1 GB size"; And when I set ntree to be a
very
> small number like 10, it will run for hours.
> I use the (x,y) rather than the (formula,data).
>
> My code:
>
> > sdata<-read.csv("D://zSignal Dump//XXXX//XXXX.csv")
> > sdata1<-subset(sdata,select=-38)
> > sdata2<-subset(sdata,select=38)
> > res<-randomForest(x=sdata1,y=sdata2,ntrees=10)
>
>
> Am I doing anything wrong? Or do you have other suggestions? Are there any
> other packages to do the same thing?
> I will appreciate if anyone can help me out, thanks!
>
>
> Thanks and Best regards,
> ------------------------------------------------
> Jia, Zou (×Þ¼Î), Ph.D.
> IBM Research -- China
> Diamond Building, #19 Zhongguancun Software Park, 8 Dongbeiwang West Road,
> Haidian District, Beijing 100193, P.R. China
> Tel: +86 (10) 58748518
> E-mail: jiazou@cn.ibm.com
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
Joris Meys
Statistical Consultant
Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control
Coupure Links 653
B-9000 Gent
tel : +32 9 264 59 87
Joris.Meys@Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
[[alternative HTML version deleted]]