Martin Lam
2005-Aug-26 15:52 UTC
[R] problem with certain data sets when using randomForest
Hi, Since I've had no replies on my previous post about my problem I am posting it again in the hope someone notice it. The problem is that the randomForest function doesn't take datasets which has instances only containing a subset of all the classes. So the dataset with instances that either belong to class "a" or "b" from the levels "a", "b" and "c" doesn't work because there is no instance that has class "c". Is there any way to solve this problem? library("randomForest") # load the iris plant data set dataset <- iris numberarray <- array(1:nrow(dataset), nrow(dataset), 1) # include only instances with Species = setosa or virginica indices <- t(numberarray[(dataset$Species == "setosa" | dataset$Species == "virginica") == TRUE]) finaldataset <- dataset[indices,] # just to let you see the 3 classes levels(finaldataset$Species) # create the random forest randomForest(formula = Species ~ ., data finaldataset, ntree = 5) # The error message I get Error in randomForest.default(m, y, ...) : Can't have empty classes in y. #The problem is that the finaldataset doesn't contain #any instances of "versicolor", so I think the only way #to solve this problem is by changing the levels the #"Species" have to only "setosa" and "virginica", # correct me if I'm wrong. # So I tried to change the levels but I got stuck: # get the possible unique classes uniqueItems <- unique(levels(finaldataset$Species)) # the problem! newlevels <- list(uniqueItems[1] = c(uniqueItems[1], uniqueItems[2]), uniqueItems[3] = uniqueItems[3]) # Error message Error: syntax error # In the help they use constant names to rename the #levels, so this works (but that's not what I want #because I don't want to change the code every time I #use another data set): newlevels <- list("setosa" = c(uniqueItems[1], uniqueItems[2]), "virginica" = uniqueItems[3]) levels(finaldataset$Species) <- newlevels levels(finaldataset$Species) finaldataset$Species --------------------------- Thanks in advance, Martin
Prof Brian Ripley
2005-Aug-26 16:19 UTC
[R] problem with certain data sets when using randomForest
Look at ?"[.factor": finaldataset$Species <- finaldataset$Species[,drop=TRUE] solves this. On Fri, 26 Aug 2005, Martin Lam wrote:> Hi, > > Since I've had no replies on my previous post about my > problem I am posting it again in the hope someone > notice it. The problem is that the randomForest > function doesn't take datasets which has instances > only containing a subset of all the classes. So the > dataset with instances that either belong to class "a" > or "b" from the levels "a", "b" and "c" doesn't work > because there is no instance that has class "c". Is > there any way to solve this problem? > > library("randomForest") > > # load the iris plant data set > dataset <- iris > > numberarray <- array(1:nrow(dataset), nrow(dataset), > 1) > > # include only instances with Species = setosa or > virginica > indices <- t(numberarray[(dataset$Species == "setosa" > | > dataset$Species == "virginica") == TRUE]) > > finaldataset <- dataset[indices,] > > # just to let you see the 3 classes > levels(finaldataset$Species) > > # create the random forest > randomForest(formula = Species ~ ., data > finaldataset, ntree = 5) > > # The error message I get > Error in randomForest.default(m, y, ...) : > Can't have empty classes in y. > > #The problem is that the finaldataset doesn't contain > #any instances of "versicolor", so I think the only > way #to solve this problem is by changing the levels > the #"Species" have to only "setosa" and "virginica", > # correct me if I'm wrong. > > # So I tried to change the levels but I got stuck: > > # get the possible unique classes > uniqueItems <- unique(levels(finaldataset$Species)) > > # the problem! > newlevels <- list(uniqueItems[1] = c(uniqueItems[1], > uniqueItems[2]), uniqueItems[3] = uniqueItems[3]) > > # Error message > Error: syntax error > > # In the help they use constant names to rename the > #levels, so this works (but that's not what I want > #because I don't want to change the code every time I > #use another data set): > newlevels <- list("setosa" = c(uniqueItems[1], > uniqueItems[2]), "virginica" = uniqueItems[3]) > > levels(finaldataset$Species) <- newlevels > > levels(finaldataset$Species) > > finaldataset$Species > > --------------------------- > > Thanks in advance, > > Martin > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Liaw, Andy
2005-Aug-31 13:47 UTC
[R] problem with certain data sets when using randomForest
I've been trying to play catch-up on R-help since DSC2005. This one must have slipped through... This is what I'd do: iris.sub <- subset(iris, Species %in% c("setosa", "virginica")) iris.sub$Species <- factor(iris.sub$Species) That last line drops the empty level in the factor. You can then run randomForest with that data. HTH, Andy> From: Martin Lam > > Hi, > > Since I've had no replies on my previous post about my > problem I am posting it again in the hope someone > notice it. The problem is that the randomForest > function doesn't take datasets which has instances > only containing a subset of all the classes. So the > dataset with instances that either belong to class "a" > or "b" from the levels "a", "b" and "c" doesn't work > because there is no instance that has class "c". Is > there any way to solve this problem? > > library("randomForest") > > # load the iris plant data set > dataset <- iris > > numberarray <- array(1:nrow(dataset), nrow(dataset), > 1) > > # include only instances with Species = setosa or > virginica > indices <- t(numberarray[(dataset$Species == "setosa" > | > dataset$Species == "virginica") == TRUE]) > > finaldataset <- dataset[indices,] > > # just to let you see the 3 classes > levels(finaldataset$Species) > > # create the random forest > randomForest(formula = Species ~ ., data > finaldataset, ntree = 5) > > # The error message I get > Error in randomForest.default(m, y, ...) : > Can't have empty classes in y. > > #The problem is that the finaldataset doesn't contain > #any instances of "versicolor", so I think the only > way #to solve this problem is by changing the levels > the #"Species" have to only "setosa" and "virginica", > # correct me if I'm wrong. > > # So I tried to change the levels but I got stuck: > > # get the possible unique classes > uniqueItems <- unique(levels(finaldataset$Species)) > > # the problem! > newlevels <- list(uniqueItems[1] = c(uniqueItems[1], > uniqueItems[2]), uniqueItems[3] = uniqueItems[3]) > > # Error message > Error: syntax error > > # In the help they use constant names to rename the > #levels, so this works (but that's not what I want > #because I don't want to change the code every time I > #use another data set): > newlevels <- list("setosa" = c(uniqueItems[1], > uniqueItems[2]), "virginica" = uniqueItems[3]) > > levels(finaldataset$Species) <- newlevels > > levels(finaldataset$Species) > > finaldataset$Species > > --------------------------- > > Thanks in advance, > > Martin > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >