Jia ZJ Zou
2010-May-25 09:51 UTC
[R] Need Help! Poor performance about randomForest for large data
Hi, dears, I am processing some data with 60 columns, and 286,730 rows. Most columns are numerical value, and some columns are categorical value. It turns out that: when ntree sets to the default value (500), it says "can not allocate a vector of 1.1 GB size"; And when I set ntree to be a very small number like 10, it will run for hours. I use the (x,y) rather than the (formula,data). My code:> sdata<-read.csv("D://zSignal Dump//XXXX//XXXX.csv") > sdata1<-subset(sdata,select=-38) > sdata2<-subset(sdata,select=38) > res<-randomForest(x=sdata1,y=sdata2,ntrees=10)Am I doing anything wrong? Or do you have other suggestions? Are there any other packages to do the same thing? I will appreciate if anyone can help me out, thanks! Thanks and Best regards, ------------------------------------------------ Jia, Zou (×Þ¼Î), Ph.D. IBM Research -- China Diamond Building, #19 Zhongguancun Software Park, 8 Dongbeiwang West Road, Haidian District, Beijing 100193, P.R. China Tel: +86 (10) 58748518 E-mail: jiazou@cn.ibm.com [[alternative HTML version deleted]]
Joris Meys
2010-May-25 11:01 UTC
[R] Need Help! Poor performance about randomForest for large data
Hi Jia, without seeing the actual data, it's difficult to give solid options. But it's quite normal this runs for hours : it has to make a whole lot of decisions, and it can grow tremendous large trees with that amount of data. Also the error is quite logic : you just can't store all those huge trees. Try to set the following options in RandomForest : mtry : number of variables selected at each split. Smaller number speeds up things, but this effect will be not too big. nodesize : this is the minimum node size. In default, it is 1 for classification, meaning that you build a tree until every observation is in a seperate leaf. In your case, this should be set waaaaaay higher. maxnodes : this is the maximum number of nodes. Again, with the amount of data you have, this number goes skyrocket and thus produces huge trees (you can have more than 200.000 nodes... ). No need to do that, so you should set it to a reasonable low amount. Try this for example : res <- randomForest(x=sdata1,y=sdata2,ntrees=500, mtry=5, nodesize=100,maxnodes=60) These trees assume that the minimum size of a group with similar observations is 100. Sounds reasonable, it still gives you over 2800 groups for a full tree. The maximum number of nodes I chose to allow that every variable occurs once in the tree, although it doesn't have to be this way. If you still get errors, play a bit more with those numbers. Actually, you should do that anyway, regardless of memory and computation time. RandomForest is known to have the danger of overfitting. Restricting the tree size avoids this and gives you a more general fit. Cheers Joris On Tue, May 25, 2010 at 11:51 AM, Jia ZJ Zou <jiazou@cn.ibm.com> wrote:> Hi, dears, > > I am processing some data with 60 columns, and 286,730 rows. > Most columns are numerical value, and some columns are categorical value. > > It turns out that: when ntree sets to the default value (500), it says "can > not allocate a vector of 1.1 GB size"; And when I set ntree to be a very > small number like 10, it will run for hours. > I use the (x,y) rather than the (formula,data). > > My code: > > > sdata<-read.csv("D://zSignal Dump//XXXX//XXXX.csv") > > sdata1<-subset(sdata,select=-38) > > sdata2<-subset(sdata,select=38) > > res<-randomForest(x=sdata1,y=sdata2,ntrees=10) > > > Am I doing anything wrong? Or do you have other suggestions? Are there any > other packages to do the same thing? > I will appreciate if anyone can help me out, thanks! > > > Thanks and Best regards, > ------------------------------------------------ > Jia, Zou (×Þ¼Î), Ph.D. > IBM Research -- China > Diamond Building, #19 Zhongguancun Software Park, 8 Dongbeiwang West Road, > Haidian District, Beijing 100193, P.R. China > Tel: +86 (10) 58748518 > E-mail: jiazou@cn.ibm.com > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 Joris.Meys@Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]