Here's one way, worked out in lots of steps so you can see
how each works:
> mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000,
2000, 30, 4))), something = runif(3034))
> str(mydata)
'data.frame': 3034 obs. of 2 variables:
$ MyFactor : Factor w/ 4 levels
"A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
$ something: num 0.725 0.222 0.347 0.614 0.968 ...>
> table(mydata$MyFactor)
A B C D
1000 2000 30 4>
>
> important.levels <- table(mydata$MyFactor) / nrow(mydata)
> important.levels <- names(important.levels)[important.levels > .01]
> important.levels
[1] "A" "B">
> newdata <- mydata[mydata$MyFactor %in% important.levels, ]
> table(newdata$MyFactor)
A B C D
1000 2000 0 0>
>
> newdata$MyFactor <- factor(newdata$MyFactor, levels=important.levels)
> table(newdata$MyFactor)
A B
1000 2000>
On Wed, Jan 18, 2012 at 5:25 PM, Sam Steingold <sds at gnu.org>
wrote:> I have a data frame with some factor columns.
> I want to drop the rows with rare factor values
> (and remove the factor values from the factors).
> E.g., ?frame$MyFactor takes values
> A 1,000 times,
> B 2,000 times,
> C 30 times and
> D 4 times.
> I want to remove all rows which assume rare values (<1%), i.e., C and D.
> i.e.,
> frame <- frame[[! (frame$MyFactor %in% c("A","B"))]]
> except that I probably got the syntax wrong
> and I want c("A","B") to be generated automatically
from frame$MyFactor
> and the number 0.01 (1%).
>
> Thanks!
--
Sarah Goslee
http://www.functionaldiversity.org