Sam Albers
2012-Jan-27 19:26 UTC
[R] Subsetting for the ten highest values by group in a dataframe
Hello, I am looking for a way to subset a data frame by choosing the top ten maximum values from that dataframe. As well this occurs within some factor levels. ## I've used plyr here but I'm not married to this approach require(plyr) ## I've created a data.frame with two groups and then a id variable (y) df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B")) ## So using ddply I can find the highest value of x df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1]) ## Or the 2nd highest value df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2]) ## And so on.... but when I try to make a series of numbers like so ## to get the top ten values, I don't get a warning message but ## two values that don't really make sense to me df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10]) ## So no error message when I use the method above, which is clearly wrong. ## But I really am not sure how to diagnose the problem. ## Can anyone suggest a way to subset a data.frame with groups to select the top ten max values in that data.frame for each group? ## Thanks so much in advance? Sam
Phil Spector
2012-Jan-27 22:08 UTC
[R] Subsetting for the ten highest values by group in a dataframe
Sam - I think that subset is what's throwing you off here -- you need a function that will simply return the 10 rows of each group with the highest values of x: function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],] Then ddply(df,'z',function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],]) should give you what you want. In this simple case, you could also use do.call(rbind,by(df,df$z,function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],])) from base R to get the same result. Hope this helps. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Fri, 27 Jan 2012, Sam Albers wrote:> Hello, > > I am looking for a way to subset a data frame by choosing the top ten > maximum values from that dataframe. As well this occurs within some > factor levels. > > ## I've used plyr here but I'm not married to this approach > require(plyr) > > ## I've created a data.frame with two groups and then a id variable (y) > df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B")) > > ## So using ddply I can find the highest value of x > df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1]) > > ## Or the 2nd highest value > df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2]) > > ## And so on.... but when I try to make a series of numbers like so > ## to get the top ten values, I don't get a warning message but > ## two values that don't really make sense to me > df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10]) > > ## So no error message when I use the method above, which is clearly wrong. > ## But I really am not sure how to diagnose the problem. > > ## Can anyone suggest a way to subset a data.frame with groups to > select the top ten max values in that data.frame for each group? > > ## Thanks so much in advance? > > Sam > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hadley Wickham
2012-Jan-28 13:55 UTC
[R] Subsetting for the ten highest values by group in a dataframe
On Fri, Jan 27, 2012 at 1:26 PM, Sam Albers <tonightsthenight at gmail.com> wrote:> Hello, > > I am looking for a way to subset a data frame by choosing the top ten > maximum values from that dataframe. As well this occurs within some > factor levels. > > ## I've used plyr here but I'm not married to this approach > require(plyr) > > ## I've created a data.frame with two groups and then a id variable (y) > df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B")) > > ## So using ddply I can find the highest value of x > df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1]) > > ## Or the 2nd highest value > df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2]) > > ## And so on.... but when I try to make a series of numbers like so > ## to get the top ten values, I don't get a warning message but > ## two values that don't really make sense to me > df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])Well, sort returns a vector, so you probably want df.max <- ddply(df, c("z"), subset, x %in% sort(x, TRUE)[1:10]) but it would be better to do (e.g.) df.max <- ddply(df, c("z"), subset, rank(x) <= 10) which will also make it possible to deal with ties in a principled way. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
Gabor Grothendieck
2012-Jan-28 16:26 UTC
[R] Subsetting for the ten highest values by group in a dataframe
On Fri, Jan 27, 2012 at 2:26 PM, Sam Albers <tonightsthenight at gmail.com> wrote:> Hello, > > I am looking for a way to subset a data frame by choosing the top ten > maximum values from that dataframe. As well this occurs within some > factor levels. > > ## I've used plyr here but I'm not married to this approach > require(plyr) > > ## I've created a data.frame with two groups and then a id variable (y) > df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B")) > > ## So using ddply I can find the highest value of x > df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1]) > > ## Or the 2nd highest value > df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2]) > > ## And so on.... but when I try to make a series of numbers like so > ## to get the top ten values, I don't get a warning message but > ## two values that don't really make sense to me > df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10]) > > ## So no error message when I use the method above, which is clearly wrong. > ## But I really am not sure how to diagnose the problem. > > ## Can anyone suggest a way to subset a data.frame with groups to > select the top ten max values in that data.frame for each group? > > ## Thanks so much in advance? >(1) Here is a pure R solution. The result of ave is a vector with the rank within group of each row so we just take the ones that are 10 or less: df.top <- df[ave(-df$x, df$z, FUN = rank) <= 10, ] We may or may not wish to use the ties.method= argument to rank depending on how ties are to be handled. The question did not indicate that the result needed to be sorted but if it does then df.top can be sorted like this: df.top[with(df.top, order(z, -x)), ] (2) Here is an sqldf solution that uses the default SQLite database: library(sqldf) sqldf("select * from df i where rowid in (select rowid from df where z = i.z order by x desc limit 10) order by i.z, i.x desc") This always gives 10 in each group even if there are ties. We can omit the last order by clause if the result need not be sorted. (3) and here is an sqldf using the PostgreSQL database (which has additional simplifying features): library(sqldf) library(RPostgreSQL) sqldf("select * from (select *, row_number() over (partition by z order by x desc) rank from df) s where rank <= 10 order by z, rank") Depending on how we wish to handle ties we may need to replace row_number() with rank(). Also, we can omit the last order by clause if the result need not be sorted. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com