Dear all, Can anybody give me some hint on the following error msg I got with using randomForest? I have two-class classification problem. The data file "sample" is: ---------------------------------------------------------- udomain.edu udomain.hcs hpclass 1 1.0000 1 not 2 NA 2 not 3 NA 0.8 not 4 NA 0.2 hp 5 NA 0.9 hp ------------------------------------------------------------ The steps I called the function are: (1) Read data hp <- read.table("sample") (2) Call randomForest hp.rf <- randomForest(hpclass ~., yy, data=hp, importance=TRUE, proximity=TRUE) But the error msg I got is: Error in randomForest.default(m, y, ...) : Need at least two classes to do classification. I learned the usage of randomForest from: http://www.maths.lth.se/help/R/.R/library/randomForest/html/randomForest.html Thanks a lot for any of your comments in advance! Hui Han Department of Computer Science and Engineering, The Pennsylvania State University University Park, PA,16802 email: hhan at cse.psu.edu homepage: http://www.cse.psu.edu/~hhan
On Wed, 31 Mar 2004, Hui Han wrote:> Dear all, > > Can anybody give me some hint on the following error msg I got with using > randomForest? > > I have two-class classification problem. The data file "sample" is: > ---------------------------------------------------------- > udomain.edu udomain.hcs hpclass > 1 1.0000 1 not > 2 NA 2 not > 3 NA 0.8 not > 4 NA 0.2 hp > 5 NA 0.9 hp > ------------------------------------------------------------ > The steps I called the function are: > (1) Read data > hp <- read.table("sample")most probably a problem here. say R> summary(hp) and check if the factor `hpclass' has two levels. Torsten> (2) Call randomForest > hp.rf <- randomForest(hpclass ~., yy, data=hp, importance=TRUE, > proximity=TRUE) > > But the error msg I got is: > Error in randomForest.default(m, y, ...) : > Need at least two classes to do classification. > > > I learned the usage of randomForest from: > http://www.maths.lth.se/help/R/.R/library/randomForest/html/randomForest.html > > Thanks a lot for any of your comments in advance! > > > Hui Han > Department of Computer Science and Engineering, > The Pennsylvania State University > University Park, PA,16802 > email: hhan at cse.psu.edu > homepage: http://www.cse.psu.edu/~hhan > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > >
Thinking that the following suggestions by Matt may be helpful to others, I am fowarding his notes to R-list. Regards, Hui On Wed, Mar 31, 2004 at 08:57:13AM -0800, Austin, Matt wrote:> Use na.action=na.omit in your function call to delete those rows, but this > can give you problems if you want to use follow-up methods such as the > partial.plot(). This is what I usually do: > > naRows <- apply(data2, 1, function(x) any(is.na(x))) > > sum(!(naRows)) > > data2.noNAs <- data2[!naRows,] > > chg.rf <- randomForest(ch13 ~ .,data=data2.noNAs, importance=TRUE, > keep.forest=TRUE) > > > That way when I call partial.plot() like in the following example I don't > run into trouble with NAs in the original dataset not matching with what was > used in the random forest fit. > > > postscript("temp.ps", horizontal=TRUE) > par(mfrow=c(4,4)) > for(i in 1:length(varNames)){ > partial.plot(chg.f, data2.noNAs, varNames[i], ylim=c(.95, 1.7)) > } > dev.off() > > > -----Original Message----- > From: Hui Han [mailto:hhan at cse.psu.edu] > Sent: Wednesday, March 31, 2004 8:12 AM > To: Austin, Matt > Subject: Re: [R] help with the usage of "randomForest" > > > Matt, > > I appreciate your help so much!! Yes, I changed all NAs to real values, and > the error msg. disappeared. > However my real dataset contains many NAs. Can you give me more suggestions > on how to define na.action not be na.fail? > > Thank you so much again, > > Hui > > On Wed, Mar 31, 2004 at 08:02:47AM -0800, Austin, Matt wrote: > > What is yy? Is this your subset index? If so make sure that you are not > > removing all of one class. Note that the default na.action in > randomForest > > is na.fail, so even if your subsetting isn't removing all of the rows with > > an NA the method should still fail. > > > > --Matt > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Hui Han > > Sent: Wednesday, March 31, 2004 6:11 AM > > To: r-help at stat.math.ethz.ch > > Subject: [R] help with the usage of "randomForest" > > > > > > Dear all, > > > > Can anybody give me some hint on the following error msg I got with using > > randomForest? > > > > I have two-class classification problem. The data file "sample" is: > > ---------------------------------------------------------- > > udomain.edu udomain.hcs hpclass > > 1 1.0000 1 not > > 2 NA 2 not > > 3 NA 0.8 not > > 4 NA 0.2 hp > > 5 NA 0.9 hp > > ------------------------------------------------------------ > > The steps I called the function are: > > (1) Read data > > hp <- read.table("sample") > > (2) Call randomForest > > hp.rf <- randomForest(hpclass ~., yy, data=hp, importance=TRUE, > > proximity=TRUE) > > > > But the error msg I got is: > > Error in randomForest.default(m, y, ...) : > > Need at least two classes to do classification. > > > > > > I learned the usage of randomForest from: > > > http://www.maths.lth.se/help/R/.R/library/randomForest/html/randomForest.htm > > l > > > > Thanks a lot for any of your comments in advance! > > > > > > Hui Han > > Department of Computer Science and Engineering, > > The Pennsylvania State University > > University Park, PA,16802 > > email: hhan at cse.psu.edu > > homepage: http://www.cse.psu.edu/~hhan > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > Hui Han > Department of Computer Science and Engineering, > The Pennsylvania State University > University Park, PA,16802 > email: hhan at cse.psu.edu > homepage: http://www.cse.psu.edu/~hhanHui Han Department of Computer Science and Engineering, The Pennsylvania State University University Park, PA,16802 email: hhan at cse.psu.edu homepage: http://www.cse.psu.edu/~hhan
As you've learned, using the formula interface, the NAs are handled by na.action. (BTW, the R default is na.omit, so the NAs are silently omitted. If it were na.fail, you would have gotten an error message.) There are several options on handling NAs, na.omit being one of them. If you have too many NAs, omitting them would leave you too little data, as you experienced. One possibility is to use na.roughfix (in the randomForest package) as na.action, which replaces the NAs with the median of the variable (or the mode for factor variable). If you want to, you can use rfImpute to use randomForest itself to impute NAs (assuming your training data isn't terribly big). HTH, Andy> From: Hui Han > > Thanks for Matt and Torsten for very helpful suggestions! > As Matt pointed out, the problem is that na.action has the > default value of na.fail, that > deleted one class samples. I changed all NAs to real values, > and the error msg. > dissappeared. > > However my real dataset contains many NAs. I wonder if > anybody can point me any documentations on > how to define na.action not be na.fail? > > Best regards, > Hui > > On Wed, Mar 31, 2004 at 06:26:36PM +0200, Torsten Hothorn wrote: > > On Wed, 31 Mar 2004, Hui Han wrote: > > > > > Dear all, > > > > > > Can anybody give me some hint on the following error msg > I got with using > > > randomForest? > > > > > > I have two-class classification problem. The data file > "sample" is: > > > ---------------------------------------------------------- > > > udomain.edu udomain.hcs hpclass > > > 1 1.0000 1 not > > > 2 NA 2 not > > > 3 NA 0.8 not > > > 4 NA 0.2 hp > > > 5 NA 0.9 hp > > > ------------------------------------------------------------ > > > The steps I called the function are: > > > (1) Read data > > > hp <- read.table("sample") > > > > most probably a problem here. say > > > > R> summary(hp) > > > > and check if the factor `hpclass' has two levels. > > > > Torsten > > > > > (2) Call randomForest > > > hp.rf <- randomForest(hpclass ~., yy, data=hp, importance=TRUE, > > > proximity=TRUE) > > > > > > But the error msg I got is: > > > Error in randomForest.default(m, y, ...) : > > > Need at least two classes to do classification. > > > > > > > > > I learned the usage of randomForest from: > > > > http://www.maths.lth.se/help/R/.R/library/randomForest/html/randomForest.html> > > > Thanks a lot for any of your comments in advance! > > > > > > Hui Han > > Department of Computer Science and Engineering, > > The Pennsylvania State University > > University Park, PA,16802 > > email: hhan at cse.psu.edu > > homepage: http://www.cse.psu.edu/~hhan > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html> > > >Hui Han Department of Computer Science and Engineering, The Pennsylvania State University University Park, PA,16802 email: hhan at cse.psu.edu homepage: http://www.cse.psu.edu/~hhan ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Possibly Parallel Threads
- randomForest: more than one variable needed?
- Random Forest:how to do an automatic rerun using only the important variables
- Error in names(x) <- value: 'names' attribute must be the same length as the vector
- some question regarding random forest
- randomForest Error passing string argument