Adam D. I. Kramer
2008-May-02 22:43 UTC
[R] Coercing by/tapply to data.frame for more than two indices?
Dear Colleagues, Apologies for a long email to ask what I feel may be a very simple question; I figure it's better to overspecify my situation. I was asked a question, recently, by a colleague in my department about pre-aggregating variables, i.e., computing the mean of defined subsets of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as they have always been the solution for me. However, my colleague had three indices, and as such needs to pay attention to the indices of the output...this is to say, the "create an array" function of tapply doesn't quite work because an array is not quite what we want. Consider this data set: df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)), var2= factor(rep(rep(1:5,each=25*5),10), trial= rep(rep(1:25,25),10), id= factor(rep(1:10,each=5*5*25)), score= rnorm(n=5*5*25*10) ) ...this is to say, each of 10 ids has scores for 5 different levels of var1 and 5 different levels of var2...across 25 trials. Basically, a three-way crossed repeated measures design...where tapply does what I want for a two-way design, it does not quite suit my purposes for a 3-way or n-way for n > 2. The goal is to predict score from var1 and var2. The straightforward guess of what to do would be to simply have the AOV function aggregate across trials: aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df) (or lm with defined contrasts) ...however, there are missing data on some trials for some people, which makes this design unbalanced (i.e., it introduces a correlation between var1 and var2). Because my colleague knows (from a theoretical standpoint) that he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD be balanced, which is to say, the analysis he wants to run would produce different output from the above. So, what he needs is a data frame with four variables instead of five: var1, var2, id, and mscore (mean score), which has been averaged across trials. Clearly (to me, it seems), the way to do this is with tapply: x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE) ...which returns a var1*var2 matrix for each ID, when what I want is a observation-per-row data frame. So, my question: How do I end up with what I'm looking for? My current process involves setting df2 <- data.frame(mscore=c(x), ...) where ... is a bunch of factor(rep) columns that would specify the var1 var2 and id levels. My problem with this approach is that it seems like a hack; it is not a general solution because I must use knowledge of the process by which x was generated in order to "get it right," and there's a decent amount of room for unnoticed error on my part. I suppose what I'm looking for is either a way to take by or tapply and have it return a set of index variable columns based on the list of indices I provide to it...or a way to collapse an n-way table into a single data frame with index variables. Any suggestions? Cordially, Adam D. I. Kramer Ph.D. Candidate, Social Psychology University of Oregon
jim holtman
2008-May-03 05:20 UTC
[R] Coercing by/tapply to data.frame for more than two indices?
?aggregate> aggregate(df$score, list(df$var1, df$var2, df$id), mean, na.rm=TRUE)Group.1 Group.2 Group.3 x 1 1 1 1 0.1053576980 2 2 1 1 0.1514888520 3 3 1 1 0.1270477403 4 4 1 1 -0.0193129404 5 5 1 1 0.2574346931 6 1 2 1 0.0185013523 7 2 2 1 -0.0886420632 8 3 2 1 -0.1304342272 9 4 2 1 -0.0972963702 10 5 2 1 -0.1463502593 On Fri, May 2, 2008 at 6:43 PM, Adam D. I. Kramer <adik at ilovebacon.org> wrote:> Dear Colleagues, > > Apologies for a long email to ask what I feel may be a very simple > question; I figure it's better to overspecify my situation. > > I was asked a question, recently, by a colleague in my department > about pre-aggregating variables, i.e., computing the mean of defined subsets > of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as > they have always been the solution for me. However, my colleague had three > indices, and as such needs to pay attention to the indices of the > output...this is to say, the "create an array" function of tapply doesn't > quite work because an array is not quite what we want. > > Consider this data set: > > df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)), > var2= factor(rep(rep(1:5,each=25*5),10), > trial= rep(rep(1:25,25),10), > id= factor(rep(1:10,each=5*5*25)), > score= rnorm(n=5*5*25*10) ) > > ...this is to say, each of 10 ids has scores for 5 different levels of > var1 and 5 different levels of var2...across 25 trials. Basically, a > three-way crossed repeated measures design...where tapply does what I want > for a two-way design, it does not quite suit my purposes for a 3-way or > n-way for n > 2. > > The goal is to predict score from var1 and var2. The straightforward guess > of what to do would be to simply have the AOV function aggregate across > trials: > > aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df) > > (or lm with defined contrasts) > > ...however, there are missing data on some trials for some people, which > makes this design unbalanced (i.e., it introduces a correlation between var1 > and var2). Because my colleague knows (from a theoretical standpoint) that > he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD > be balanced, which is to say, the analysis he wants to run would produce > different output from the above. > > So, what he needs is a data frame with four variables instead of five: var1, > var2, id, and mscore (mean score), which has been averaged across trials. > > Clearly (to me, it seems), the way to do this is with tapply: > > x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE) > > ...which returns a var1*var2 matrix for each ID, when what I want is a > observation-per-row data frame. > > So, my question: How do I end up with what I'm looking for? > > My current process involves setting df2 <- data.frame(mscore=c(x), ...) > where ... is a bunch of factor(rep) columns that would specify the var1 var2 > and id levels. My problem with this approach is that it seems like a hack; > it is not a general solution because I must use knowledge of the process by > which x was generated in order to "get it right," and there's a decent > amount of room for unnoticed error on my part. > > I suppose what I'm looking for is either a way to take by or tapply and have > it return a set of index variable columns based on the list of indices I > provide to it...or a way to collapse an n-way table into a single data frame > with index variables. Any suggestions? > > Cordially, > > Adam D. I. Kramer > Ph.D. Candidate, Social Psychology > University of Oregon > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
Adam D. I. Kramer
2008-May-03 20:35 UTC
[R] Coercing by/tapply to data.frame for more than two indices?
Dear Colleagues, Apologies for a long email to ask what I feel may be a very simple question; I figure it's better to overspecify my situation. I was asked a question, recently, by a colleague in my department about pre-aggregating variables, i.e., computing the mean of defined subsets of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as they have always been the solution for me. However, my colleague had three indices, and as such needs to pay attention to the indices of the output...this is to say, the "create an array" function of tapply doesn't quite work because an array is not quite what we want. Consider this data set: df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)), var2= factor(rep(rep(1:5,each=25*5),10), trial= rep(rep(1:25,25),10), id= factor(rep(1:10,each=5*5*25)), score= rnorm(n=5*5*25*10) ) ...this is to say, each of 10 ids has scores for 5 different levels of var1 and 5 different levels of var2...across 25 trials. Basically, a three-way crossed repeated measures design...where tapply does what I want for a two-way design, it does not quite suit my purposes for a 3-way or n-way for n > 2. The goal is to predict score from var1 and var2. The straightforward guess of what to do would be to simply have the AOV function aggregate across trials: aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df) (or lm with defined contrasts) ...however, there are missing data on some trials for some people, which makes this design unbalanced (i.e., it introduces a correlation between var1 and var2). Because my colleague knows (from a theoretical standpoint) that he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD be balanced, which is to say, the analysis he wants to run would produce different output from the above. So, what he needs is a data frame with four variables instead of five: var1, var2, id, and mscore (mean score), which has been averaged across trials. Clearly (to me, it seems), the way to do this is with tapply: x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE) ...which returns a var1*var2 matrix for each ID, when what I want is a observation-per-row data frame. So, my question: How do I end up with what I'm looking for? My current process involves setting df2 <- data.frame(mscore=c(x), ...) where ... is a bunch of factor(rep) columns that would specify the var1 var2 and id levels. My problem with this approach is that it seems like a hack; it is not a general solution because I must use knowledge of the process by which x was generated in order to "get it right," and there's a decent amount of room for unnoticed error on my part. I suppose what I'm looking for is either a way to take by or tapply and have it return a set of index variable columns based on the list of indices I provide to it...or a way to collapse an n-way table into a single data frame with index variables. Any suggestions? Cordially, Adam D. I. Kramer Ph.D. Candidate, Social Psychology University of Oregon