thr3ads.net - R help - [R] by-group processing [May 2009]

If this information is useful, please help other people find it:
Share via:

Max Webber

2009-May-06 22:09 UTC

[R] by-group processing

Given a dataframe like

  > data
        ID Type N
  1  45900    A 1
  2  45900    B 2
  3  45900    C 3
  4  45900    D 4
  5  45900    E 5
  6  45900    F 6
  7  45900    I 7
  8  49270    A 1
  9  49270    B 2
  10 49270    E 3
  18 46550    A 1
  19 46550    B 2
  20 46550    C 3
  21 46550    D 4
  22 46550    E 5
  23 46550    F 6
  24 46550    I 7
  >

containing an identifier (ID), a variable type code (Type), and
a running count of the number of records per ID (N), how can I
return a dataframe of only those records with the maximum value
of N for each ID? For instance,

  > data
        ID Type N
  7  45900    I 7
  10 49270    E 3
  24 46550    I 7

I know that I can use

   > tapply ( data $ N , data $ ID , max )
   45900 46550 49270
       7     7     3
   >

to get the values of the maximum N for each ID, but how is it
that I can find the index of these values to subsequently use to
subscript data?


--
maxine-webber

Jorge Ivan Velez

2009-May-07 00:17 UTC

head link

[R] by-group processing

Dear Max,
By using "d" instead of "data" for your data set, here is
one way:

# First order the data by ID
d <- with(d, d[order(ID),] )

# Then use tapply to get the indexes for the maximum values
d[cumsum(with(d, tapply(N, ID, which.max))),]
# ID Type N
# 7  45900    I 7
# 24 46550    I 7
# 10 49270    E 3

HTH,

Jorge


On Wed, May 6, 2009 at 6:09 PM, Max Webber <ubbermax@gmail.com> wrote:
> Given a dataframe like
>
>  > data
>        ID Type N
>  1  45900    A 1
>  2  45900    B 2
>  3  45900    C 3
>  4  45900    D 4
>  5  45900    E 5
>  6  45900    F 6
>  7  45900    I 7
>  8  49270    A 1
>  9  49270    B 2
>  10 49270    E 3
>  18 46550    A 1
>  19 46550    B 2
>  20 46550    C 3
>  21 46550    D 4
>  22 46550    E 5
>  23 46550    F 6
>  24 46550    I 7
>  >
>
> containing an identifier (ID), a variable type code (Type), and
> a running count of the number of records per ID (N), how can I
> return a dataframe of only those records with the maximum value
> of N for each ID? For instance,
>
>  > data
>        ID Type N
>  7  45900    I 7
>  10 49270    E 3
>  24 46550    I 7
>
> I know that I can use
>
>   > tapply ( data $ N , data $ ID , max )
>   45900 46550 49270
>       7     7     3
>   >
>
> to get the values of the maximum N for each ID, but how is it
> that I can find the index of these values to subsequently use to
> subscript data?
>
>
> --
> maxine-webber
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

jim holtman

2009-May-07 01:12 UTC

head link

[R] by-group processing

Ths should do it:
> do.call(rbind, lapply(split(x, x$ID), tail, 1))         ID Type N
45900 45900    I 7
46550 46550    I 7
49270 49270    E 3


On Wed, May 6, 2009 at 6:09 PM, Max Webber <ubbermax@gmail.com> wrote:
> Given a dataframe like
>
>  > data
>        ID Type N
>  1  45900    A 1
>  2  45900    B 2
>  3  45900    C 3
>  4  45900    D 4
>  5  45900    E 5
>  6  45900    F 6
>  7  45900    I 7
>  8  49270    A 1
>  9  49270    B 2
>  10 49270    E 3
>  18 46550    A 1
>  19 46550    B 2
>  20 46550    C 3
>  21 46550    D 4
>  22 46550    E 5
>  23 46550    F 6
>  24 46550    I 7
>  >
>
> containing an identifier (ID), a variable type code (Type), and
> a running count of the number of records per ID (N), how can I
> return a dataframe of only those records with the maximum value
> of N for each ID? For instance,
>
>  > data
>        ID Type N
>  7  45900    I 7
>  10 49270    E 3
>  24 46550    I 7
>
> I know that I can use
>
>   > tapply ( data $ N , data $ ID , max )
>   45900 46550 49270
>       7     7     3
>   >
>
> to get the values of the maximum N for each ID, but how is it
> that I can find the index of these values to subsequently use to
> subscript data?
>
>
> --
> maxine-webber
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

William Dunlap

2009-May-07 19:42 UTC

head link

[R] by-group processing

Max,

Since the dataset is sorted by ID, with ties broken by N, the following
should do it and do it quickly.  It grabs the rows just before ID
changes.
> with(data, data[ c(ID[-1] != ID[-length(ID)], TRUE),, drop=FALSE])      ID Type N
7  45900    I 7
10 49270    E 3
24 46550    I 7

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Max Webber
> Sent: Wednesday, May 06, 2009 3:09 PM
> To: r-help at r-project.org
> Subject: [R] by-group processing
> 
> Given a dataframe like
> 
>   > data
>         ID Type N
>   1  45900    A 1
>   2  45900    B 2
>   3  45900    C 3
>   4  45900    D 4
>   5  45900    E 5
>   6  45900    F 6
>   7  45900    I 7
>   8  49270    A 1
>   9  49270    B 2
>   10 49270    E 3
>   18 46550    A 1
>   19 46550    B 2
>   20 46550    C 3
>   21 46550    D 4
>   22 46550    E 5
>   23 46550    F 6
>   24 46550    I 7
>   >
> 
> containing an identifier (ID), a variable type code (Type), and
> a running count of the number of records per ID (N), how can I
> return a dataframe of only those records with the maximum value
> of N for each ID? For instance,
> 
>   > data
>         ID Type N
>   7  45900    I 7
>   10 49270    E 3
>   24 46550    I 7
> 
> I know that I can use
> 
>    > tapply ( data $ N , data $ ID , max )
>    45900 46550 49270
>        7     7     3
>    >
> 
> to get the values of the maximum N for each ID, but how is it
> that I can find the index of these values to subsequently use to
> subscript data?
> 
> 
> --
> maxine-webber
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

David Freedman

2009-May-07 23:44 UTC

head link

[R] by-group processing

how about:

d=data[order(data$ID,-data$Type),]
d[!duplicated(d$ID),]


Max Webber wrote:> 
> Given a dataframe like
> 
>   > data
>         ID Type N
>   1  45900    A 1
>   2  45900    B 2
>   3  45900    C 3
>   4  45900    D 4
>   5  45900    E 5
>   6  45900    F 6
>   7  45900    I 7
>   8  49270    A 1
>   9  49270    B 2
>   10 49270    E 3
>   18 46550    A 1
>   19 46550    B 2
>   20 46550    C 3
>   21 46550    D 4
>   22 46550    E 5
>   23 46550    F 6
>   24 46550    I 7
>   >
> 
> containing an identifier (ID), a variable type code (Type), and
> a running count of the number of records per ID (N), how can I
> return a dataframe of only those records with the maximum value
> of N for each ID? For instance,
> 
>   > data
>         ID Type N
>   7  45900    I 7
>   10 49270    E 3
>   24 46550    I 7
> 
> I know that I can use
> 
>    > tapply ( data $ N , data $ ID , max )
>    45900 46550 49270
>        7     7     3
>    >
> 
> to get the values of the maximum N for each ID, but how is it
> that I can find the index of these values to subsequently use to
> subscript data?
> 
> 
> --
> maxine-webber
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
-- 
View this message in context:
http://www.nabble.com/by-group-processing-tp23417208p23437592.html
Sent from the R help mailing list archive at Nabble.com.

David Huffer

2009-May-08 14:57 UTC

head link

[R] by-group processing

On Thursday, May 07, 2009 7:45 PM, David Freedman wrote:

 > ...how about:
 >   d=data[order(data$ID,-data$Type),]
 >   d[!duplicated(d$ID),]

Does the "-data$Type" argument to the order function work?

--
 David
?
 -----------------------------------------------------
 David Huffer, Ph.D.               Senior Statistician
 CSOSA/Washington, DC           david.huffer at csosa.gov

hadley wickham

2009-May-09 01:11 UTC

head link

[R] by-group processing

On Wed, May 6, 2009 at 8:12 PM, jim holtman <jholtman at gmail.com>
wrote:> Ths should do it:
>
>> do.call(rbind, lapply(split(x, x$ID), tail, 1))
> ? ? ? ? ID Type N
> 45900 45900 ? ?I 7
> 46550 46550 ? ?I 7
> 49270 49270 ? ?E 3

Or with plyr:

library(plyr)
ddply(x, "id", tail, 1)

plyr encapsulates the common split-apply-combine strategy and takes
cares of the details for you.  Read more about it on
http://had.co.nz/plyr

Hadley

-- 
http://had.co.nz/

Seemingly Similar Threads

Search for more seemingly similar threads

R help - May 2009 - by-group processing

[R] by-group processing

[R] by-group processing

[R] by-group processing

[R] by-group processing

[R] by-group processing

[R] by-group processing

[R] by-group processing

Seemingly Similar Threads