thr3ads.net - R help - [R] Calculation of group summaries [Jul 2005]

If this information is useful, please help other people find it:
Share via:

Seeliger.Curt@epamail.epa.gov

2005-Jul-12 17:51 UTC

[R] Calculation of group summaries

I know R has a steep learning curve, but from where I stand the slope
looks like a sheer cliff.  I'm pawing through the available docs and
have come across examples which come close to what I want but are
proving difficult for me to modify for my use.

Calculating simple group means is fairly straight forward:
  data(PlantGrowth)
  attach(PlantGrowth)
  stack(mean(unstack(PlantGrowth)))

I'd like to do something slightly more complex, using a data frame and
groups identified by unique combinations of three id variables.  There
may be thousands of such combinations in the data.  This is easy in SQL:

  select year,
         site_id,
         visit_no,
         mean(undercut) AS meanUndercut,
         count(undercut) AS nUndercut,
         std(undercut) AS stdUndercut
  from channelMorphology
  group by year, site_id, visit_no
      ;

Reading a CSV written by SAS and selecting only records expected to have
values is also straight forward in R, but getting those summary values
for each site visit is currently beyond me:

  sub<-read.csv('c:/data/channelMorphology.csv'
               ,header=TRUE
               ,na.strings='.'
               ,sep=','
               ,strip.white=TRUE
               )

  undercut<-subset(sub,
                  ,TRANSDIR %in% c('LF','RT')

,select=c('YEAR','SITE_ID','VISIT_NO','TRANSECT','TRANSDIR'
                           ,'UNDERCUT'
                           )
                  ,drop=TRUE
                  )


Thanks all for your help.
cur
--
Curt Seeliger, Data Ranger
CSC, EPA/WED contractor
541/754-4638
seeliger.curt at epa.gov

Francisco J. Zagmutt

2005-Jul-12 18:34 UTC

head link

[R] Calculation of group summaries

Take a look at ?aggregate ?ave and ?tapply

Cheers

Francisco
>From: Seeliger.Curt at epamail.epa.gov
>To: R-Help <r-help at stat.math.ethz.ch>
>Subject: [R] Calculation of group summaries
>Date: Tue, 12 Jul 2005 10:51:03 -0700
>
>I know R has a steep learning curve, but from where I stand the slope
>looks like a sheer cliff.  I'm pawing through the available docs and
>have come across examples which come close to what I want but are
>proving difficult for me to modify for my use.
>
>Calculating simple group means is fairly straight forward:
>   data(PlantGrowth)
>   attach(PlantGrowth)
>   stack(mean(unstack(PlantGrowth)))
>
>I'd like to do something slightly more complex, using a data frame and
>groups identified by unique combinations of three id variables.  There
>may be thousands of such combinations in the data.  This is easy in SQL:
>
>   select year,
>          site_id,
>          visit_no,
>          mean(undercut) AS meanUndercut,
>          count(undercut) AS nUndercut,
>          std(undercut) AS stdUndercut
>   from channelMorphology
>   group by year, site_id, visit_no
>       ;
>
>Reading a CSV written by SAS and selecting only records expected to have
>values is also straight forward in R, but getting those summary values
>for each site visit is currently beyond me:
>
>   sub<-read.csv('c:/data/channelMorphology.csv'
>                ,header=TRUE
>                ,na.strings='.'
>                ,sep=','
>                ,strip.white=TRUE
>                )
>
>   undercut<-subset(sub,
>                   ,TRANSDIR %in% c('LF','RT')
>
>,select=c('YEAR','SITE_ID','VISIT_NO','TRANSECT','TRANSDIR'
>                            ,'UNDERCUT'
>                            )
>                   ,drop=TRUE
>                   )
>
>
>Thanks all for your help.
>cur
>--
>Curt Seeliger, Data Ranger
>CSC, EPA/WED contractor
>541/754-4638
>seeliger.curt at epa.gov
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! 
>http://www.R-project.org/posting-guide.html

Frank E Harrell Jr

2005-Jul-12 19:57 UTC

head link

[R] Calculation of group summaries

See http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/SasByMeansExample
for one example.

Frank


Seeliger.Curt at epamail.epa.gov wrote:> I know R has a steep learning curve, but from where I stand the slope
> looks like a sheer cliff.  I'm pawing through the available docs and
> have come across examples which come close to what I want but are
> proving difficult for me to modify for my use.
> 
> Calculating simple group means is fairly straight forward:
>   data(PlantGrowth)
>   attach(PlantGrowth)
>   stack(mean(unstack(PlantGrowth)))
> 
> I'd like to do something slightly more complex, using a data frame and
> groups identified by unique combinations of three id variables.  There
> may be thousands of such combinations in the data.  This is easy in SQL:
> 
>   select year,
>          site_id,
>          visit_no,
>          mean(undercut) AS meanUndercut,
>          count(undercut) AS nUndercut,
>          std(undercut) AS stdUndercut
>   from channelMorphology
>   group by year, site_id, visit_no
>       ;
> 
> Reading a CSV written by SAS and selecting only records expected to have
> values is also straight forward in R, but getting those summary values
> for each site visit is currently beyond me:
> 
>   sub<-read.csv('c:/data/channelMorphology.csv'
>                ,header=TRUE
>                ,na.strings='.'
>                ,sep=','
>                ,strip.white=TRUE
>                )
> 
>   undercut<-subset(sub,
>                   ,TRANSDIR %in% c('LF','RT')
> 
>
,select=c('YEAR','SITE_ID','VISIT_NO','TRANSECT','TRANSDIR'
>                            ,'UNDERCUT'
>                            )
>                   ,drop=TRUE
>                   )
> 
> 
> Thanks all for your help.
> cur
> --
> Curt Seeliger, Data Ranger
> CSC, EPA/WED contractor
> 541/754-4638
> seeliger.curt at epa.gov
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Seeliger.Curt@epamail.epa.gov

2005-Jul-15 00:31 UTC

head link

[R] Calculation of group summaries

Several people suggested specific functions (by, tapply, sapply and
others); thanks for not blowing off a simple question regarding how to
do the following SQL in R:>   select year,
>          site_id,
>          visit_no,
>          mean(undercut) AS meanUndercut,
>          count(undercut) AS nUndercut,
>          std(undercut) AS stdUndercut
>   from channelMorphology
>   group by year, site_id, visit_no
>   ;
I'd spent quite a bit of time with the suggested functions earlier but
had no luck as I'd misread the docs and put the entire dataframe where
it only wants the columns to be processed.  Sometimes it's the simplest
of things.

This has lead to another confoundment-- sd() acts differently than
mean() for some reason, at least with R 1.9.0.  For some reason, means
generate NA results and a warning message for each group:

  argument is not numeric or logical: returning NA in:
mean.default(data[x, ], ...)

Of course, the argument is numeric, or there'd be no sd value.  Or more
likely, I'm still missing something really basic. If I wrap the value in
as.numeric() things work fine.  Why should I have to do this for mean
and median, but not sd? The code below should reproduce this error

  # Fake data for demo:
  nsites<-6
  yearList<-1999:2001
  fakesub<-as.data.frame(cbind(
                 year     =rep(yearList,nsites/length(yearList),each=11)
                ,site_id 
=rep(c('site1','site2'),each=11*nsites)
                ,visit_no =rep(1,11*2*nsites)
                ,transect =rep(LETTERS[1:11],nsites,each=2)
                ,transdir =rep(c('LF','RT'),11*nsites)
                ,undercut =abs(rnorm(11*2*nsites,10))
                ,angle    =runif(11*2*nsites,0,180)
                ))

  # Create group summaries:
  sdmets<-by(fakesub$undercut
            ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
            ,sd
            )
  nmets<-by(fakesub$undercut
           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
           ,length
           )
  xmets<-by(fakesub$undercut
           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
           ,mean
           )
   xmets<-by(as.numeric(fakesub$undercut)
           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
           ,mean
           )

  # Put site id values (year, site_id and visit_no) into results:
  # List unique id combinations as a list of lists.  Then
  # reorganize that into 3 vectors for final results.
  # Certainly, there MUST be a better way...
  foo<-strsplit(unique(paste(fakesub$year
                            ,fakesub$site_id
                            ,fakesub$visit_no
                            ,sep='#'))
               ,split='#'
               )
  year<-list()
  for(i in 1:length(foo)) {year<-rbind(year,foo[[i]][1])}
  site_id<-list()
  for(i in 1:length(foo)) {site_id<-rbind(site_id,foo[[i]][2])}
  visit_no<-list()
  for(i in 1:length(foo)) {visit_no<-rbind(visit_no,foo[[i]][3])}

  # Final result, more or less
  data.frame(cbind(a=year,b=site_id,c=visit_no,sdmets,nmets,xmets))


cur

--
Curt Seeliger, Data Ranger
CSC, EPA/WED contractor
541/754-4638
seeliger.curt at epa.gov

Søren Højsgaard

2005-Jul-15 09:59 UTC

head link

[R] Calculation of group summaries

Perhaps I lost track of what the original question was, but on my homepage
http://genetics.agrsci.dk/~sorenh/misc/ there is a package called doBy in which
there is a function called summaryBy (which mimics proc summary from sas). For
example
	summaryBy(cbind(Weight,Feed)~Evit+Cu,  data=dietox12, FUN=c(mean,myvar))  
S??ren


-----Oprindelig meddelelse-----
Fra: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at
stat.math.ethz.ch] P?? vegne af Gabor Grothendieck
Sendt: 15. juli 2005 04:43
Til: Seeliger.Curt at epamail.epa.gov
Cc: R-Help
Emne: Re: [R] Calculation of group summaries

1. Try using more spaces so your code is easier to read.

2. Use data.frame to define your data frame (since the method in your post
creates data frames of factors rather than the desired classes).

3. Given the appropriate function, f, a single 'by' statement
rbind'ed together, as shown, will create the result.

nsites <- 6
yearList <- 1999:2001
fakesub <- data.frame(
	year = rep(yearList, nsites/length(yearList), each = 11),
	site_id  = rep(c('site1','site2'), each = 11*nsites),
	visit_no = rep(1, 11*2*nsites),
	transect = rep(LETTERS[1:11], nsites, each = 2),
	transdir = rep(c('LF','RT'), 11*nsites),
	undercut = abs(rnorm(11*2*nsites, 10)),
	angle    = runif(11*2*nsites, 0, 180)
)


f <- function(x) cbind(year = x[1,1], site_id = x[1,2], visit_no = x[1,3], 
	mean = mean(x[,6]), sd = sd(x[,6]), length = length(x[,6]))
do.call("rbind", by(fakesub, fakesub[,1:3], f))





On 7/14/05, Seeliger.Curt at epamail.epa.gov <Seeliger.Curt at
epamail.epa.gov> wrote:> Several people suggested specific functions (by, tapply, sapply and 
> others); thanks for not blowing off a simple question regarding how to 
> do the following SQL in R:
> >   select year,
> >          site_id,
> >          visit_no,
> >          mean(undercut) AS meanUndercut,
> >          count(undercut) AS nUndercut,
> >          std(undercut) AS stdUndercut
> >   from channelMorphology
> >   group by year, site_id, visit_no
> >   ;
> 
> I'd spent quite a bit of time with the suggested functions earlier but 
> had no luck as I'd misread the docs and put the entire dataframe where 
> it only wants the columns to be processed.  Sometimes it's the 
> simplest of things.
> 
> This has lead to another confoundment-- sd() acts differently than
> mean() for some reason, at least with R 1.9.0.  For some reason, means 
> generate NA results and a warning message for each group:
> 
>  argument is not numeric or logical: returning NA in:
> mean.default(data[x, ], ...)
> 
> Of course, the argument is numeric, or there'd be no sd value.  Or 
> more likely, I'm still missing something really basic. If I wrap the 
> value in
> as.numeric() things work fine.  Why should I have to do this for mean 
> and median, but not sd? The code below should reproduce this error
> 
>  # Fake data for demo:
>  nsites<-6
>  yearList<-1999:2001
>  fakesub<-as.data.frame(cbind(
>                 year     =rep(yearList,nsites/length(yearList),each=11)
>                ,site_id 
=rep(c('site1','site2'),each=11*nsites)
>                ,visit_no =rep(1,11*2*nsites)
>                ,transect =rep(LETTERS[1:11],nsites,each=2)
>                ,transdir =rep(c('LF','RT'),11*nsites)
>                ,undercut =abs(rnorm(11*2*nsites,10))
>                ,angle    =runif(11*2*nsites,0,180)
>                ))
> 
>  # Create group summaries:
>  sdmets<-by(fakesub$undercut
>            ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
>            ,sd
>            )
>  nmets<-by(fakesub$undercut
>           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
>           ,length
>           )
>  xmets<-by(fakesub$undercut
>           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
>           ,mean
>           )
>   xmets<-by(as.numeric(fakesub$undercut)
>           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
>           ,mean
>           )
> 
>  # Put site id values (year, site_id and visit_no) into results:
>  # List unique id combinations as a list of lists.  Then  # reorganize 
> that into 3 vectors for final results.
>  # Certainly, there MUST be a better way...
>  foo<-strsplit(unique(paste(fakesub$year
>                            ,fakesub$site_id
>                            ,fakesub$visit_no
>                            ,sep='#'))
>               ,split='#'
>               )
>  year<-list()
>  for(i in 1:length(foo)) {year<-rbind(year,foo[[i]][1])}
>  site_id<-list()
>  for(i in 1:length(foo)) {site_id<-rbind(site_id,foo[[i]][2])}
>  visit_no<-list()
>  for(i in 1:length(foo)) {visit_no<-rbind(visit_no,foo[[i]][3])}
> 
>  # Final result, more or less
>  data.frame(cbind(a=year,b=site_id,c=visit_no,sdmets,nmets,xmets))
> 
> 
> cur
> 
> --
> Curt Seeliger, Data Ranger
> CSC, EPA/WED contractor
> 541/754-4638
> seeliger.curt at epa.gov
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Jul 2005 - Calculation of group summaries

[R] Calculation of group summaries

[R] Calculation of group summaries

[R] Calculation of group summaries

[R] Calculation of group summaries

[R] Calculation of group summaries

Possibly Parallel Threads