thr3ads.net - R help - [R] Group averages [Jun 2006]

If this information is useful, please help other people find it:
Share via:

David Kling

2006-Jun-12 21:19 UTC

[R] Group averages

Hello:

I hope none of you will mind helping a newbie.  I'm a student research 
assistant working with a large data set in which observations are 
categorized according to two factors. I'm trying to calculate the group 
mean and variance of a variable (called 'hsgpa' in the example data 
presented below) to each observation  , excluding that observation.  For 
example, if there are 20 observations with the same value of the two 
factors, for each of the 20 I'd like to generate the mean and variance 
of the 'hsgpa' values of the other 19 group members.  This must be done 
for every observation in the data set.

I've searched the R mail archives, read the manuals, and read 
documentation for tapply() andby() as well as summaryBy() in the 'doBy' 
package and with() from 'Hmisc.'  It may be that since I'm new to 
writing functions and R is the first language I've ever worked with I'm 
less able to come up with a solution than some other new R users.  None 
of the functions I have tried have been succesful, and it doesn't seem 
worth it to reproduce and explain my best effort.  I hope someone has 
some ideas!  Looking at what an experienced user would try should help 
me with my present task as well as future problems.

Below I've included some lines that will generate a sample data set 
similar to the one I'm working with:

#
#Example data:
#
case <- sample(seq(1,10000,1),5000,replace=FALSE)
hsgpa <- rbeta(5000,7,1.5)*4.25
yr <- sample(seq(1993,2005,1),5000,replace=TRUE)
conf <- sample(letters[1:5],5000,replace=TRUE)
data <- data.frame(case=case,hsgpa=hsgpa,yr=yr,conf=conf)
data$conf <- as.character(data$conf)
s1 <- sample(seq(1,5000,1),500,replace=FALSE)
k <- data$hsgpa
k[row.names(data) %in% s1] <- NA
data$hsgpa <- k
s2 <- sample(seq(1,5000,1),100,replace=FALSE)
k <- data$yr
k[row.names(data) %in% s2] <- NA
data$yr <- k
k <- data$conf
k[row.names(data) %in% s2] <- NA
data$conf <- k
remove(case,hsgpa,yr,conf,s1,s2,k)
#

jim holtman

2006-Jun-12 22:25 UTC

head link

[R] Group averages

Not exactly sure what you mean, but here is something that might be close.
I used only a subset of your data to see it this is what you want.  This
computes the mean of all hpgpa, excluding that row:
> data[x[['2005.e']],]  # subset of your data for yr=2005,
conf='e'     case    hsgpa   yr conf
73   3442 3.406104 2005    e
216  3017 4.071830 2005    e
284  3626 3.418870 2005    e
797  2184 3.459729 2005    e
881  3030 3.147831 2005    e
1030 9600 4.140025 2005    e
1071 1972 3.423202 2005    e
1100 8293 3.880199 2005    e
1219 5162 3.470179 2005    e
1276 5905 3.533801 2005    e
1312 3785 3.521670 2005    e
1363 8880 2.975047 2005    e
1426  123 3.070349 2005    e
1427  947       NA 2005    e
1475 3592 3.955794 2005    e
1635  366 3.172360 2005    e
1708 5257 3.612822 2005    e
1736 6256       NA 2005    e
1831 2112 3.719371 2005    e
1943 6528 3.322816 2005    e
1997  553       NA 2005    e
2208 2849 3.657016 2005    e
2240 6543       NA 2005    e
2360 9360       NA 2005    e
2611 4354 3.123671 2005    e
2659 1444 4.080455 2005    e
2704 9502       NA 2005    e
2714 8594 3.657861 2005    e
2732 4453 2.251620 2005    e
2778  875 3.913294 2005    e
2802 4022 3.970620 2005    e
2884 4473 3.650706 2005    e
2945  181 3.777851 2005    e
3059 6755 3.809683 2005    e
3327 8153       NA 2005    e
3380 3737 3.676996 2005    e
3404 4419 2.306697 2005    e
3577 3577 4.196025 2005    e
3608  457 4.150389 2005    e
3857 8642 3.220720 2005    e
3967  482 2.147233 2005    e
4122 4363       NA 2005    e
4185  651 4.087515 2005    e
4226  544 4.153056 2005    e
4362 1496 3.835143 2005    e
4475 1614 3.978524 2005    e
4680 6883 3.633342 2005    e
4739 5212       NA 2005    e
4843 3515 3.020855 2005    e
4867 2580 3.814048 2005    e
4887 7937 3.797753 2005    e> y <- data[x[['2005.e']],]
> str(y)`data.frame':   51 obs. of  4 variables:
 $ case : num  3442 3017 3626 2184 3030 ...
 $ hsgpa: num  3.41 4.07 3.42 3.46 3.15 ...
 $ yr   : num  2005 2005 2005 2005 2005 ...
 $ conf : chr  "e" "e" "e" "e"
...> # compute the mean of all except the given row
> sapply(seq(nrow(y)), function(x) mean(y$hsgpa[-x],na.rm=TRUE)) [1] 3.556268 3.540030 3.555956 3.554960 3.562567 3.538367 3.555851 3.544704
3.554705 3.553153
[11] 3.553449 3.566781 3.564457 3.552692 3.542861 3.561969 3.551226 3.552692
3.548627 3.558299
[21] 3.552692 3.550148 3.552692 3.552692 3.563156 3.539820 3.552692 3.550127
3.584426 3.543897
[31] 3.542499 3.550302 3.547201 3.546424 3.552692 3.549660 3.583082 3.537001
3.538114 3.560789
[41] 3.586972 3.552692 3.539648 3.538049 3.545803 3.542306 3.550725 3.552692
3.565664 3.546318
[51] 3.546715> y$mean <- sapply(seq(nrow(y)), function(x) mean(y$hsgpa[-x],na.rm=TRUE))
> y     case    hsgpa   yr conf     mean
73   3442 3.406104 2005    e 3.556268
216  3017 4.071830 2005    e 3.540030
284  3626 3.418870 2005    e 3.555956
797  2184 3.459729 2005    e 3.554960
881  3030 3.147831 2005    e 3.562567
1030 9600 4.140025 2005    e 3.538367
1071 1972 3.423202 2005    e 3.555851
1100 8293 3.880199 2005    e 3.544704
1219 5162 3.470179 2005    e 3.554705
1276 5905 3.533801 2005    e 3.553153
1312 3785 3.521670 2005    e 3.553449
1363 8880 2.975047 2005    e 3.566781
1426  123 3.070349 2005    e 3.564457
1427  947       NA 2005    e 3.552692
1475 3592 3.955794 2005    e 3.542861
1635  366 3.172360 2005    e 3.561969
1708 5257 3.612822 2005    e 3.551226
1736 6256       NA 2005    e 3.552692
1831 2112 3.719371 2005    e 3.548627
1943 6528 3.322816 2005    e 3.558299
1997  553       NA 2005    e 3.552692
2208 2849 3.657016 2005    e 3.550148
2240 6543       NA 2005    e 3.552692
2360 9360       NA 2005    e 3.552692
2611 4354 3.123671 2005    e 3.563156
2659 1444 4.080455 2005    e 3.539820
2704 9502       NA 2005    e 3.552692
2714 8594 3.657861 2005    e 3.550127
2732 4453 2.251620 2005    e 3.584426
2778  875 3.913294 2005    e 3.543897
2802 4022 3.970620 2005    e 3.542499
2884 4473 3.650706 2005    e 3.550302
2945  181 3.777851 2005    e 3.547201
3059 6755 3.809683 2005    e 3.546424
3327 8153       NA 2005    e 3.552692
3380 3737 3.676996 2005    e 3.549660
3404 4419 2.306697 2005    e 3.583082
3577 3577 4.196025 2005    e 3.537001
3608  457 4.150389 2005    e 3.538114
3857 8642 3.220720 2005    e 3.560789
3967  482 2.147233 2005    e 3.586972
4122 4363       NA 2005    e 3.552692
4185  651 4.087515 2005    e 3.539648
4226  544 4.153056 2005    e 3.538049
4362 1496 3.835143 2005    e 3.545803
4475 1614 3.978524 2005    e 3.542306
4680 6883 3.633342 2005    e 3.550725
4739 5212       NA 2005    e 3.552692
4843 3515 3.020855 2005    e 3.565664
4867 2580 3.814048 2005    e 3.546318
4887 7937 3.797753 2005    e 3.546715>


On 6/12/06, David Kling <klingd@reed.edu> wrote:>
> Hello:
>
> I hope none of you will mind helping a newbie.  I'm a student research
> assistant working with a large data set in which observations are
> categorized according to two factors. I'm trying to calculate the group
> mean and variance of a variable (called 'hsgpa' in the example data
> presented below) to each observation  , excluding that observation.  For
> example, if there are 20 observations with the same value of the two
> factors, for each of the 20 I'd like to generate the mean and variance
> of the 'hsgpa' values of the other 19 group members.  This must be
done
> for every observation in the data set.
>
> I've searched the R mail archives, read the manuals, and read
> documentation for tapply() andby() as well as summaryBy() in the
'doBy'
> package and with() from 'Hmisc.'  It may be that since I'm new
to
> writing functions and R is the first language I've ever worked with
I'm
> less able to come up with a solution than some other new R users.  None
> of the functions I have tried have been succesful, and it doesn't seem
> worth it to reproduce and explain my best effort.  I hope someone has
> some ideas!  Looking at what an experienced user would try should help
> me with my present task as well as future problems.
>
> Below I've included some lines that will generate a sample data set
> similar to the one I'm working with:
>
> #
> #Example data:
> #
> case <- sample(seq(1,10000,1),5000,replace=FALSE)
> hsgpa <- rbeta(5000,7,1.5)*4.25
> yr <- sample(seq(1993,2005,1),5000,replace=TRUE)
> conf <- sample(letters[1:5],5000,replace=TRUE)
> data <- data.frame(case=case,hsgpa=hsgpa,yr=yr,conf=conf)
> data$conf <- as.character(data$conf)
> s1 <- sample(seq(1,5000,1),500,replace=FALSE)
> k <- data$hsgpa
> k[row.names(data) %in% s1] <- NA
> data$hsgpa <- k
> s2 <- sample(seq(1,5000,1),100,replace=FALSE)
> k <- data$yr
> k[row.names(data) %in% s2] <- NA
> data$yr <- k
> k <- data$conf
> k[row.names(data) %in% s2] <- NA
> data$conf <- k
> remove(case,hsgpa,yr,conf,s1,s2,k)
> #
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390 (Cell)
+1 513 247 0281 (Home)

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

Gabor Grothendieck

2006-Jun-12 22:25 UTC

head link

[R] Group averages

Assuming that yr and conf are the two factors referred to in the
description, create a function f which calculates the ith row
of the output and use sapply like this:

attach(data)
f <- function(i) {
	hsgpa <- na.omit(hsgpa[-i][conf[-i] == conf[i] & yr[-i] == yr[i]])
	if (length(hsgpa)) c(mean = mean(hsgpa), var = var(hsgpa))
	else c(mean = NA, var = NA)
}
out <- t(sapply(1:nrow(data), f))

On 6/12/06, David Kling <klingd at reed.edu>
wrote:> Hello:
>
> I hope none of you will mind helping a newbie.  I'm a student research
> assistant working with a large data set in which observations are
> categorized according to two factors. I'm trying to calculate the group
> mean and variance of a variable (called 'hsgpa' in the example data
> presented below) to each observation  , excluding that observation.  For
> example, if there are 20 observations with the same value of the two
> factors, for each of the 20 I'd like to generate the mean and variance
> of the 'hsgpa' values of the other 19 group members.  This must be
done
> for every observation in the data set.
>
> I've searched the R mail archives, read the manuals, and read
> documentation for tapply() andby() as well as summaryBy() in the
'doBy'
> package and with() from 'Hmisc.'  It may be that since I'm new
to
> writing functions and R is the first language I've ever worked with
I'm
> less able to come up with a solution than some other new R users.  None
> of the functions I have tried have been succesful, and it doesn't seem
> worth it to reproduce and explain my best effort.  I hope someone has
> some ideas!  Looking at what an experienced user would try should help
> me with my present task as well as future problems.
>
> Below I've included some lines that will generate a sample data set
> similar to the one I'm working with:
>
> #
> #Example data:
> #
> case <- sample(seq(1,10000,1),5000,replace=FALSE)
> hsgpa <- rbeta(5000,7,1.5)*4.25
> yr <- sample(seq(1993,2005,1),5000,replace=TRUE)
> conf <- sample(letters[1:5],5000,replace=TRUE)
> data <- data.frame(case=case,hsgpa=hsgpa,yr=yr,conf=conf)
> data$conf <- as.character(data$conf)
> s1 <- sample(seq(1,5000,1),500,replace=FALSE)
> k <- data$hsgpa
> k[row.names(data) %in% s1] <- NA
> data$hsgpa <- k
> s2 <- sample(seq(1,5000,1),100,replace=FALSE)
> k <- data$yr
> k[row.names(data) %in% s2] <- NA
> data$yr <- k
> k <- data$conf
> k[row.names(data) %in% s2] <- NA
> data$conf <- k
> remove(case,hsgpa,yr,conf,s1,s2,k)
> #
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

David Kling

2006-Jun-12 23:48 UTC

head link

[R] Group averages

Thanks!  Both responders understood what I was after despite my poor 
explanation and came up with very helpful responses.  If anyone else has 
an idea, please share!

David Kling

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jun 2006 - Group averages

[R] Group averages

[R] Group averages

[R] Group averages

[R] Group averages

Possibly Parallel Threads