James Kebinger
2010-Apr-13 01:33 UTC
[R] efficiently picking one row from a data frame per unique key
Hello all, I'm trying to transform data frames by grouping the rows by the values in a particular column, ordered by another column, then picking the first row in each group. I'd like to convert a data frame like this: x y z 1 10 20 1 11 19 2 12 18 4 13 17 into one with three rows, like this, where i've discarded one row: x y z 1 1 11 19 2 2 12 18 4 4 13 17 I've got a solution using aggregate, but it gets very slow with any volume of data - the performance seems mostly IO bound and never finishes with a data set ~6MB Here's how I'm currently trying to do this d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17)) d.ordered = d[order(-d$y),] aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]}) I've tried to use split and unsplit, but unsplit complained about duplicate row names when reassembling the sub frames. thanks for your suggestions -james [[alternative HTML version deleted]]
Phil Spector
2010-Apr-13 01:49 UTC
[R] efficiently picking one row from a data frame per unique key
James - If I understand you correctly: getone = function(df)df[order(df$x,df$y),][1,] describes what you want from each data frame corresponding to a unique value of x. Then, supposing that your data frame is called df: sdf = split(df,df$x) will create a list of data frames for the unique values of x, and do.call(rbind,lapply(sdf,getone)) will return a data frame with one row for each unique value of x. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Mon, 12 Apr 2010, James Kebinger wrote:> Hello all, I'm trying to transform data frames by grouping the rows by the > values in a particular column, ordered by another column, then picking the > first row in each group. > > I'd like to convert a data frame like this: > > x y z > 1 10 20 > 1 11 19 > 2 12 18 > 4 13 17 > > into one with three rows, like this, where i've discarded one row: > > x y z > 1 1 11 19 > 2 2 12 18 > 4 4 13 17 > > I've got a solution using aggregate, but it gets very slow with any volume > of data - the performance seems mostly IO bound and never finishes with a > data set ~6MB > > Here's how I'm currently trying to do this > > d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17)) > d.ordered = d[order(-d$y),] > aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]}) > > I've tried to use split and unsplit, but unsplit complained about duplicate > row names when reassembling the sub frames. > > thanks for your suggestions > > -james > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Peter Alspach
2010-Apr-13 01:52 UTC
[R] efficiently picking one row from a data frame per unique key
Tena koe James You might try duplicated(), or more to the point !duplicated() orderedData[!duplicated(orderedData$x),] HTH .... Peter Alspach> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of James Kebinger > Sent: Tuesday, 13 April 2010 1:34 p.m. > To: r-help at r-project.org > Subject: [R] efficiently picking one row from a data frame per unique > key > > Hello all, I'm trying to transform data frames by grouping the rows by > the > values in a particular column, ordered by another column, then picking > the > first row in each group. > > I'd like to convert a data frame like this: > > x y z > 1 10 20 > 1 11 19 > 2 12 18 > 4 13 17 > > into one with three rows, like this, where i've discarded one row: > > x y z > 1 1 11 19 > 2 2 12 18 > 4 4 13 17 > > I've got a solution using aggregate, but it gets very slow with any > volume > of data - the performance seems mostly IO bound and never finisheswith> a > data set ~6MB > > Here's how I'm currently trying to do this > > d = data.frame(x=c(1,1,2,4),y=c(10,11,12,13),z=c(20,19,18,17)) > d.ordered = d[order(-d$y),] > aggregate(d.ordered,by=list(key=d.ordered$x),FUN=function(x){x[1]}) > > I've tried to use split and unsplit, but unsplit complained about > duplicate > row names when reassembling the sub frames. > > thanks for your suggestions > > -james > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.