Mikkel Grum
2005-Jul-08 14:16 UTC
[R] removing factor level represented by less than x rows
In a number of different situations I'm trying to
remove factor levels that are represented by less than
a certain number of rows, e.g. if I had the dataset aa
below and wanted to remove the species that are
represented in less than 2 rows:
data(iris)
aa <- iris[1:101,]
In this case, since I can see that the species
virginica only has one row, I can write:
table(aa$Species)
setosa versicolor virginica
50 50 1
aa[aa$Species != "virginica", ]
but:
aa[aa$Species == names(table(aa$Species)> 2),]
does not work.
This must be a fairly common task with a straight
forward solution that I can't see. Any ideas?
Best wishes,
Mikkel
Sundar Dorai-Raj
2005-Jul-08 14:39 UTC
[R] removing factor level represented by less than x rows
Mikkel Grum wrote:> In a number of different situations I'm trying to > remove factor levels that are represented by less than > a certain number of rows, e.g. if I had the dataset aa > below and wanted to remove the species that are > represented in less than 2 rows: > > data(iris) > aa <- iris[1:101,] > > In this case, since I can see that the species > virginica only has one row, I can write: > > table(aa$Species) > setosa versicolor virginica > 50 50 1 > > aa[aa$Species != "virginica", ] > > but: > > aa[aa$Species == names(table(aa$Species)> 2),] > > does not work. >If you take a look at "table(aa$Species) > 2" you'll see your first mistake. Namely, the names are all still present. Your second mistake is to use "==" to match two names. "==" does not work like that. What you want is "%in%" instead. I think you want the following: keep <- levels(aa$Species)[table(aa$Species) > 2] aa <- aa[aa$Species %in% keep, ] However, the level for "virginica" is still present in the Species variable. If you would like to drop this completely, then try aa$Species <- aa$Species[drop = TRUE] HTH, --sundar
Frank E Harrell Jr
2005-Jul-08 15:16 UTC
[R] removing factor level represented by less than x rows
Mikkel Grum wrote:> In a number of different situations I'm trying to > remove factor levels that are represented by less than > a certain number of rows, e.g. if I had the dataset aa > below and wanted to remove the species that are > represented in less than 2 rows: > > data(iris) > aa <- iris[1:101,] > > In this case, since I can see that the species > virginica only has one row, I can write: > > table(aa$Species) > setosa versicolor virginica > 50 50 1 > > aa[aa$Species != "virginica", ] > > but: > > aa[aa$Species == names(table(aa$Species)> 2),] > > does not work. > > This must be a fairly common task with a straight > forward solution that I can't see. Any ideas? > > Best wishes, > Mikkellibrary(Hmisc) ?combine.levels This doesn't remove levels but combines infrequent ones though. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University