Mikkel Grum
2005-Jul-08 14:16 UTC
[R] removing factor level represented by less than x rows
In a number of different situations I'm trying to remove factor levels that are represented by less than a certain number of rows, e.g. if I had the dataset aa below and wanted to remove the species that are represented in less than 2 rows: data(iris) aa <- iris[1:101,] In this case, since I can see that the species virginica only has one row, I can write: table(aa$Species) setosa versicolor virginica 50 50 1 aa[aa$Species != "virginica", ] but: aa[aa$Species == names(table(aa$Species)> 2),] does not work. This must be a fairly common task with a straight forward solution that I can't see. Any ideas? Best wishes, Mikkel
Sundar Dorai-Raj
2005-Jul-08 14:39 UTC
[R] removing factor level represented by less than x rows
Mikkel Grum wrote:> In a number of different situations I'm trying to > remove factor levels that are represented by less than > a certain number of rows, e.g. if I had the dataset aa > below and wanted to remove the species that are > represented in less than 2 rows: > > data(iris) > aa <- iris[1:101,] > > In this case, since I can see that the species > virginica only has one row, I can write: > > table(aa$Species) > setosa versicolor virginica > 50 50 1 > > aa[aa$Species != "virginica", ] > > but: > > aa[aa$Species == names(table(aa$Species)> 2),] > > does not work. >If you take a look at "table(aa$Species) > 2" you'll see your first mistake. Namely, the names are all still present. Your second mistake is to use "==" to match two names. "==" does not work like that. What you want is "%in%" instead. I think you want the following: keep <- levels(aa$Species)[table(aa$Species) > 2] aa <- aa[aa$Species %in% keep, ] However, the level for "virginica" is still present in the Species variable. If you would like to drop this completely, then try aa$Species <- aa$Species[drop = TRUE] HTH, --sundar
Frank E Harrell Jr
2005-Jul-08 15:16 UTC
[R] removing factor level represented by less than x rows
Mikkel Grum wrote:> In a number of different situations I'm trying to > remove factor levels that are represented by less than > a certain number of rows, e.g. if I had the dataset aa > below and wanted to remove the species that are > represented in less than 2 rows: > > data(iris) > aa <- iris[1:101,] > > In this case, since I can see that the species > virginica only has one row, I can write: > > table(aa$Species) > setosa versicolor virginica > 50 50 1 > > aa[aa$Species != "virginica", ] > > but: > > aa[aa$Species == names(table(aa$Species)> 2),] > > does not work. > > This must be a fairly common task with a straight > forward solution that I can't see. Any ideas? > > Best wishes, > Mikkellibrary(Hmisc) ?combine.levels This doesn't remove levels but combines infrequent ones though. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University