Martin Lam
2005-Aug-26 15:52 UTC
[R] problem with certain data sets when using randomForest
Hi,
Since I've had no replies on my previous post about my
problem I am posting it again in the hope someone
notice it. The problem is that the randomForest
function doesn't take datasets which has instances
only containing a subset of all the classes. So the
dataset with instances that either belong to class "a"
or "b" from the levels "a", "b" and "c"
doesn't work
because there is no instance that has class "c". Is
there any way to solve this problem?
library("randomForest")
# load the iris plant data set
dataset <- iris
numberarray <- array(1:nrow(dataset), nrow(dataset),
1)
# include only instances with Species = setosa or
virginica
indices <- t(numberarray[(dataset$Species == "setosa"
|
dataset$Species == "virginica") == TRUE])
finaldataset <- dataset[indices,]
# just to let you see the 3 classes
levels(finaldataset$Species)
# create the random forest
randomForest(formula = Species ~ ., data finaldataset, ntree = 5)
# The error message I get
Error in randomForest.default(m, y, ...) :
Can't have empty classes in y.
#The problem is that the finaldataset doesn't contain
#any instances of "versicolor", so I think the only
way #to solve this problem is by changing the levels
the #"Species" have to only "setosa" and
"virginica",
# correct me if I'm wrong.
# So I tried to change the levels but I got stuck:
# get the possible unique classes
uniqueItems <- unique(levels(finaldataset$Species))
# the problem!
newlevels <- list(uniqueItems[1] = c(uniqueItems[1],
uniqueItems[2]), uniqueItems[3] = uniqueItems[3])
# Error message
Error: syntax error
# In the help they use constant names to rename the
#levels, so this works (but that's not what I want
#because I don't want to change the code every time I
#use another data set):
newlevels <- list("setosa" = c(uniqueItems[1],
uniqueItems[2]), "virginica" = uniqueItems[3])
levels(finaldataset$Species) <- newlevels
levels(finaldataset$Species)
finaldataset$Species
---------------------------
Thanks in advance,
Martin
Prof Brian Ripley
2005-Aug-26 16:19 UTC
[R] problem with certain data sets when using randomForest
Look at ?"[.factor": finaldataset$Species <- finaldataset$Species[,drop=TRUE] solves this. On Fri, 26 Aug 2005, Martin Lam wrote:> Hi, > > Since I've had no replies on my previous post about my > problem I am posting it again in the hope someone > notice it. The problem is that the randomForest > function doesn't take datasets which has instances > only containing a subset of all the classes. So the > dataset with instances that either belong to class "a" > or "b" from the levels "a", "b" and "c" doesn't work > because there is no instance that has class "c". Is > there any way to solve this problem? > > library("randomForest") > > # load the iris plant data set > dataset <- iris > > numberarray <- array(1:nrow(dataset), nrow(dataset), > 1) > > # include only instances with Species = setosa or > virginica > indices <- t(numberarray[(dataset$Species == "setosa" > | > dataset$Species == "virginica") == TRUE]) > > finaldataset <- dataset[indices,] > > # just to let you see the 3 classes > levels(finaldataset$Species) > > # create the random forest > randomForest(formula = Species ~ ., data > finaldataset, ntree = 5) > > # The error message I get > Error in randomForest.default(m, y, ...) : > Can't have empty classes in y. > > #The problem is that the finaldataset doesn't contain > #any instances of "versicolor", so I think the only > way #to solve this problem is by changing the levels > the #"Species" have to only "setosa" and "virginica", > # correct me if I'm wrong. > > # So I tried to change the levels but I got stuck: > > # get the possible unique classes > uniqueItems <- unique(levels(finaldataset$Species)) > > # the problem! > newlevels <- list(uniqueItems[1] = c(uniqueItems[1], > uniqueItems[2]), uniqueItems[3] = uniqueItems[3]) > > # Error message > Error: syntax error > > # In the help they use constant names to rename the > #levels, so this works (but that's not what I want > #because I don't want to change the code every time I > #use another data set): > newlevels <- list("setosa" = c(uniqueItems[1], > uniqueItems[2]), "virginica" = uniqueItems[3]) > > levels(finaldataset$Species) <- newlevels > > levels(finaldataset$Species) > > finaldataset$Species > > --------------------------- > > Thanks in advance, > > Martin > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Liaw, Andy
2005-Aug-31 13:47 UTC
[R] problem with certain data sets when using randomForest
I've been trying to play catch-up on R-help since DSC2005. This one must
have slipped through...
This is what I'd do:
iris.sub <- subset(iris, Species %in% c("setosa",
"virginica"))
iris.sub$Species <- factor(iris.sub$Species)
That last line drops the empty level in the factor. You can then run
randomForest with that data.
HTH,
Andy
> From: Martin Lam
>
> Hi,
>
> Since I've had no replies on my previous post about my
> problem I am posting it again in the hope someone
> notice it. The problem is that the randomForest
> function doesn't take datasets which has instances
> only containing a subset of all the classes. So the
> dataset with instances that either belong to class "a"
> or "b" from the levels "a", "b" and
"c" doesn't work
> because there is no instance that has class "c". Is
> there any way to solve this problem?
>
> library("randomForest")
>
> # load the iris plant data set
> dataset <- iris
>
> numberarray <- array(1:nrow(dataset), nrow(dataset),
> 1)
>
> # include only instances with Species = setosa or
> virginica
> indices <- t(numberarray[(dataset$Species == "setosa"
> |
> dataset$Species == "virginica") == TRUE])
>
> finaldataset <- dataset[indices,]
>
> # just to let you see the 3 classes
> levels(finaldataset$Species)
>
> # create the random forest
> randomForest(formula = Species ~ ., data > finaldataset, ntree = 5)
>
> # The error message I get
> Error in randomForest.default(m, y, ...) :
> Can't have empty classes in y.
>
> #The problem is that the finaldataset doesn't contain
> #any instances of "versicolor", so I think the only
> way #to solve this problem is by changing the levels
> the #"Species" have to only "setosa" and
"virginica",
> # correct me if I'm wrong.
>
> # So I tried to change the levels but I got stuck:
>
> # get the possible unique classes
> uniqueItems <- unique(levels(finaldataset$Species))
>
> # the problem!
> newlevels <- list(uniqueItems[1] = c(uniqueItems[1],
> uniqueItems[2]), uniqueItems[3] = uniqueItems[3])
>
> # Error message
> Error: syntax error
>
> # In the help they use constant names to rename the
> #levels, so this works (but that's not what I want
> #because I don't want to change the code every time I
> #use another data set):
> newlevels <- list("setosa" = c(uniqueItems[1],
> uniqueItems[2]), "virginica" = uniqueItems[3])
>
> levels(finaldataset$Species) <- newlevels
>
> levels(finaldataset$Species)
>
> finaldataset$Species
>
> ---------------------------
>
> Thanks in advance,
>
> Martin
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
>