Hui Han
2004-Apr-12 18:11 UTC
[R] Random Forest:how to do an automatic rerun using only the important variables
Hi, I am using the Random Forest Package, and want to do an automatic rerun using only those variables that were most important in the original run. Is there anybody who has experience with this and can give me helpful suggestions? Best regards, Hui Han Department of Computer Science and Engineering, The Pennsylvania State University University Park, PA,16802 email: hhan at cse.psu.edu homepage: http://www.cse.psu.edu/~hhan
Liaw, Andy
2004-Apr-12 20:33 UTC
[R] Random Forest:how to do an automatic rerun using only the important variables
That's the advantage of having an R interface to RF: you can do such automation rather easily. E.g., twoStageRF <- function(x, y, nVar=round(0.1*ncol(x)), ...) { imp <- randomForest(x, y, importance=TRUE, ...)$importance[,3] cutoff <- sort(imp, decreasing=TRUE)[nVar] randomForest(x[, imp >= cutoff], y, ...) } [Disclaimer: I just wrote the function on the spot, so completely untested. This is just to demonstrate how simple it would be. You can embelish it as much as you'd like.] I have written a function that uses CV to choose the `optimal' number of variables to keep (rather than blindly select one up front). I might toss it in the next version of the package... HTH, Andy> From: Hui Han > > Hi, > > I am using the Random Forest Package, and want to do an > automatic rerun > using only those variables that were most important in the > original run. > Is there anybody who has experience with this and can give me helpful > suggestions? > > Best regards, > > Hui Han > Department of Computer Science and Engineering, > The Pennsylvania State University > University Park, PA,16802 > email: hhan at cse.psu.edu > homepage: http://www.cse.psu.edu/~hhan > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
David L. Van Brunt, Ph.D.
2004-May-07 16:29 UTC
[R] " cannot allocate vector of length 1072693248"
Well, I've done everything I can think of the make the code more efficient, but it seems (if I read this error message correctly) that I'm running out of memory on what should be small data set. Here's the sequence: A vector of 30 possible identifiers For each value, query a MySQL database to pull in 80 or so records Run a few different random Forests on those data Print the results as I go "rm()" all the objects created in the loop, including all the old data Next value, to repeat for each identifier. Round the 3rd time through the loop, I get kicked out with: "Error in as.vector(data) : cannot allocate vector of length 1072693248" This is on Mac OS X, which is UNIX-based (from freeBSD, as I understand it), so I haven't set anything special with the memory at startup. But then, I only have 80 observations running through at a time, so it doesn't seem like there should be a memory issue. Any ideas?