thr3ads.net - R help - [R] Subsetting for the ten highest values by group in a dataframe [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Sam Albers

2012-Jan-27 19:26 UTC

[R] Subsetting for the ten highest values by group in a dataframe

Hello,

I am looking for a way to subset a data frame by choosing the top ten
maximum values from that dataframe. As well this occurs within some
factor levels.

## I've used plyr here but I'm not married to this approach
require(plyr)

## I've created a data.frame with two groups and then a id variable (y)
df <- data.frame(x=rnorm(400, mean=20), y=1:400,
z=c("A","B"))

## So using ddply I can find the highest value of x
df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])

## Or the 2nd highest value
df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])

## And so on.... but when I try to make a series of numbers like so
## to get the top ten values, I don't get a warning message but
## two values that don't really make sense to me
df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])

## So no error message when I use the method above, which is clearly wrong.
## But I really am not sure how to diagnose the problem.

## Can anyone suggest a way to subset a data.frame with groups to
select the top ten max values in that data.frame for each group?

## Thanks so much in advance?

Sam

Phil Spector

2012-Jan-27 22:08 UTC

head link

[R] Subsetting for the ten highest values by group in a dataframe

Sam -
    I think that subset is what's throwing you off here --
you need a function that will simply return the 10 rows of
each group with the highest values of x:

function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],]

Then

ddply(df,'z',function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],])

should give you what you want.  In this simple case, you could
also use

do.call(rbind,by(df,df$z,function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],]))

from base R to get the same result.

Hope this helps.
 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu

On Fri, 27 Jan 2012, Sam Albers wrote:
> Hello,
>
> I am looking for a way to subset a data frame by choosing the top ten
> maximum values from that dataframe. As well this occurs within some
> factor levels.
>
> ## I've used plyr here but I'm not married to this approach
> require(plyr)
>
> ## I've created a data.frame with two groups and then a id variable (y)
> df <- data.frame(x=rnorm(400, mean=20), y=1:400,
z=c("A","B"))
>
> ## So using ddply I can find the highest value of x
> df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])
>
> ## Or the 2nd highest value
> df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])
>
> ## And so on.... but when I try to make a series of numbers like so
> ## to get the top ten values, I don't get a warning message but
> ## two values that don't really make sense to me
> df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])
>
> ## So no error message when I use the method above, which is clearly wrong.
> ## But I really am not sure how to diagnose the problem.
>
> ## Can anyone suggest a way to subset a data.frame with groups to
> select the top ten max values in that data.frame for each group?
>
> ## Thanks so much in advance?
>
> Sam
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Hadley Wickham

2012-Jan-28 13:55 UTC

head link

[R] Subsetting for the ten highest values by group in a dataframe

On Fri, Jan 27, 2012 at 1:26 PM, Sam Albers <tonightsthenight at
gmail.com> wrote:> Hello,
>
> I am looking for a way to subset a data frame by choosing the top ten
> maximum values from that dataframe. As well this occurs within some
> factor levels.
>
> ## I've used plyr here but I'm not married to this approach
> require(plyr)
>
> ## I've created a data.frame with two groups and then a id variable (y)
> df <- data.frame(x=rnorm(400, mean=20), y=1:400,
z=c("A","B"))
>
> ## So using ddply I can find the highest value of x
> df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])
>
> ## Or the 2nd highest value
> df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])
>
> ## And so on.... but when I try to make a series of numbers like so
> ## to get the top ten values, I don't get a warning message but
> ## two values that don't really make sense to me
> df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])
Well, sort returns a vector, so you probably want

df.max <- ddply(df, c("z"), subset, x %in% sort(x, TRUE)[1:10])

but it would be better to do (e.g.)

df.max <- ddply(df, c("z"), subset, rank(x) <= 10)

which will also make it possible to deal with ties in a principled way.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Gabor Grothendieck

2012-Jan-28 16:26 UTC

head link

[R] Subsetting for the ten highest values by group in a dataframe

On Fri, Jan 27, 2012 at 2:26 PM, Sam Albers <tonightsthenight at
gmail.com> wrote:> Hello,
>
> I am looking for a way to subset a data frame by choosing the top ten
> maximum values from that dataframe. As well this occurs within some
> factor levels.
>
> ## I've used plyr here but I'm not married to this approach
> require(plyr)
>
> ## I've created a data.frame with two groups and then a id variable (y)
> df <- data.frame(x=rnorm(400, mean=20), y=1:400,
z=c("A","B"))
>
> ## So using ddply I can find the highest value of x
> df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])
>
> ## Or the 2nd highest value
> df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])
>
> ## And so on.... but when I try to make a series of numbers like so
> ## to get the top ten values, I don't get a warning message but
> ## two values that don't really make sense to me
> df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])
>
> ## So no error message when I use the method above, which is clearly wrong.
> ## But I really am not sure how to diagnose the problem.
>
> ## Can anyone suggest a way to subset a data.frame with groups to
> select the top ten max values in that data.frame for each group?
>
> ## Thanks so much in advance?
>
(1) Here is a pure R solution.  The result of ave is a vector with the
rank within group of each row so we just take the ones that are 10 or
less:

df.top <- df[ave(-df$x, df$z, FUN = rank) <= 10, ]

We may or may not wish to use the ties.method= argument to rank
depending on how ties are to be handled.  The question did not
indicate that the result needed to be sorted but if it does then
df.top can be sorted like this:

df.top[with(df.top, order(z, -x)), ]

(2) Here is an sqldf solution that uses the default SQLite database:

library(sqldf)
sqldf("select * from df i
   where rowid in
    (select rowid from df where z = i.z order by x desc limit 10)
   order by i.z, i.x desc")

This always gives 10 in each group even if there are ties.  We can
omit the last order by clause if the result need not be sorted.

(3) and here is an sqldf using the PostgreSQL database (which has
additional simplifying features):

library(sqldf)
library(RPostgreSQL)

sqldf("select * from
	(select *, row_number() over  (partition by z order by x desc) rank from df) s
	where rank <= 10
	order by z, rank")

Depending on how we wish to handle ties we may need to replace
row_number() with rank().  Also, we can omit the last order by clause
if the result need not be sorted.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Reasonably Related Threads

Search for more possibly parallel threads

R help - Jan 2012 - Subsetting for the ten highest values by group in a dataframe

[R] Subsetting for the ten highest values by group in a dataframe

[R] Subsetting for the ten highest values by group in a dataframe

[R] Subsetting for the ten highest values by group in a dataframe

[R] Subsetting for the ten highest values by group in a dataframe

Reasonably Related Threads