Given a dataframe like > data ID Type N 1 45900 A 1 2 45900 B 2 3 45900 C 3 4 45900 D 4 5 45900 E 5 6 45900 F 6 7 45900 I 7 8 49270 A 1 9 49270 B 2 10 49270 E 3 18 46550 A 1 19 46550 B 2 20 46550 C 3 21 46550 D 4 22 46550 E 5 23 46550 F 6 24 46550 I 7 > containing an identifier (ID), a variable type code (Type), and a running count of the number of records per ID (N), how can I return a dataframe of only those records with the maximum value of N for each ID? For instance, > data ID Type N 7 45900 I 7 10 49270 E 3 24 46550 I 7 I know that I can use > tapply ( data $ N , data $ ID , max ) 45900 46550 49270 7 7 3 > to get the values of the maximum N for each ID, but how is it that I can find the index of these values to subsequently use to subscript data? -- maxine-webber
Dear Max, By using "d" instead of "data" for your data set, here is one way: # First order the data by ID d <- with(d, d[order(ID),] ) # Then use tapply to get the indexes for the maximum values d[cumsum(with(d, tapply(N, ID, which.max))),] # ID Type N # 7 45900 I 7 # 24 46550 I 7 # 10 49270 E 3 HTH, Jorge On Wed, May 6, 2009 at 6:09 PM, Max Webber <ubbermax@gmail.com> wrote:> Given a dataframe like > > > data > ID Type N > 1 45900 A 1 > 2 45900 B 2 > 3 45900 C 3 > 4 45900 D 4 > 5 45900 E 5 > 6 45900 F 6 > 7 45900 I 7 > 8 49270 A 1 > 9 49270 B 2 > 10 49270 E 3 > 18 46550 A 1 > 19 46550 B 2 > 20 46550 C 3 > 21 46550 D 4 > 22 46550 E 5 > 23 46550 F 6 > 24 46550 I 7 > > > > containing an identifier (ID), a variable type code (Type), and > a running count of the number of records per ID (N), how can I > return a dataframe of only those records with the maximum value > of N for each ID? For instance, > > > data > ID Type N > 7 45900 I 7 > 10 49270 E 3 > 24 46550 I 7 > > I know that I can use > > > tapply ( data $ N , data $ ID , max ) > 45900 46550 49270 > 7 7 3 > > > > to get the values of the maximum N for each ID, but how is it > that I can find the index of these values to subsequently use to > subscript data? > > > -- > maxine-webber > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Ths should do it:> do.call(rbind, lapply(split(x, x$ID), tail, 1))ID Type N 45900 45900 I 7 46550 46550 I 7 49270 49270 E 3 On Wed, May 6, 2009 at 6:09 PM, Max Webber <ubbermax@gmail.com> wrote:> Given a dataframe like > > > data > ID Type N > 1 45900 A 1 > 2 45900 B 2 > 3 45900 C 3 > 4 45900 D 4 > 5 45900 E 5 > 6 45900 F 6 > 7 45900 I 7 > 8 49270 A 1 > 9 49270 B 2 > 10 49270 E 3 > 18 46550 A 1 > 19 46550 B 2 > 20 46550 C 3 > 21 46550 D 4 > 22 46550 E 5 > 23 46550 F 6 > 24 46550 I 7 > > > > containing an identifier (ID), a variable type code (Type), and > a running count of the number of records per ID (N), how can I > return a dataframe of only those records with the maximum value > of N for each ID? For instance, > > > data > ID Type N > 7 45900 I 7 > 10 49270 E 3 > 24 46550 I 7 > > I know that I can use > > > tapply ( data $ N , data $ ID , max ) > 45900 46550 49270 > 7 7 3 > > > > to get the values of the maximum N for each ID, but how is it > that I can find the index of these values to subsequently use to > subscript data? > > > -- > maxine-webber > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]
Max, Since the dataset is sorted by ID, with ties broken by N, the following should do it and do it quickly. It grabs the rows just before ID changes.> with(data, data[ c(ID[-1] != ID[-length(ID)], TRUE),, drop=FALSE])ID Type N 7 45900 I 7 10 49270 E 3 24 46550 I 7 Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Max Webber > Sent: Wednesday, May 06, 2009 3:09 PM > To: r-help at r-project.org > Subject: [R] by-group processing > > Given a dataframe like > > > data > ID Type N > 1 45900 A 1 > 2 45900 B 2 > 3 45900 C 3 > 4 45900 D 4 > 5 45900 E 5 > 6 45900 F 6 > 7 45900 I 7 > 8 49270 A 1 > 9 49270 B 2 > 10 49270 E 3 > 18 46550 A 1 > 19 46550 B 2 > 20 46550 C 3 > 21 46550 D 4 > 22 46550 E 5 > 23 46550 F 6 > 24 46550 I 7 > > > > containing an identifier (ID), a variable type code (Type), and > a running count of the number of records per ID (N), how can I > return a dataframe of only those records with the maximum value > of N for each ID? For instance, > > > data > ID Type N > 7 45900 I 7 > 10 49270 E 3 > 24 46550 I 7 > > I know that I can use > > > tapply ( data $ N , data $ ID , max ) > 45900 46550 49270 > 7 7 3 > > > > to get the values of the maximum N for each ID, but how is it > that I can find the index of these values to subsequently use to > subscript data? > > > -- > maxine-webber > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
how about: d=data[order(data$ID,-data$Type),] d[!duplicated(d$ID),] Max Webber wrote:> > Given a dataframe like > > > data > ID Type N > 1 45900 A 1 > 2 45900 B 2 > 3 45900 C 3 > 4 45900 D 4 > 5 45900 E 5 > 6 45900 F 6 > 7 45900 I 7 > 8 49270 A 1 > 9 49270 B 2 > 10 49270 E 3 > 18 46550 A 1 > 19 46550 B 2 > 20 46550 C 3 > 21 46550 D 4 > 22 46550 E 5 > 23 46550 F 6 > 24 46550 I 7 > > > > containing an identifier (ID), a variable type code (Type), and > a running count of the number of records per ID (N), how can I > return a dataframe of only those records with the maximum value > of N for each ID? For instance, > > > data > ID Type N > 7 45900 I 7 > 10 49270 E 3 > 24 46550 I 7 > > I know that I can use > > > tapply ( data $ N , data $ ID , max ) > 45900 46550 49270 > 7 7 3 > > > > to get the values of the maximum N for each ID, but how is it > that I can find the index of these values to subsequently use to > subscript data? > > > -- > maxine-webber > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://www.nabble.com/by-group-processing-tp23417208p23437592.html Sent from the R help mailing list archive at Nabble.com.
On Thursday, May 07, 2009 7:45 PM, David Freedman wrote: > ...how about: > d=data[order(data$ID,-data$Type),] > d[!duplicated(d$ID),] Does the "-data$Type" argument to the order function work? -- David ? ----------------------------------------------------- David Huffer, Ph.D. Senior Statistician CSOSA/Washington, DC david.huffer at csosa.gov
On Wed, May 6, 2009 at 8:12 PM, jim holtman <jholtman at gmail.com> wrote:> Ths should do it: > >> do.call(rbind, lapply(split(x, x$ID), tail, 1)) > ? ? ? ? ID Type N > 45900 45900 ? ?I 7 > 46550 46550 ? ?I 7 > 49270 49270 ? ?E 3Or with plyr: library(plyr) ddply(x, "id", tail, 1) plyr encapsulates the common split-apply-combine strategy and takes cares of the details for you. Read more about it on http://had.co.nz/plyr Hadley -- http://had.co.nz/