thr3ads.net - R help - [R] Coercing by/tapply to data.frame for more than two indices? [May 2008]

If this information is useful, please help other people find it:
Share via:

Adam D. I. Kramer

2008-May-02 22:43 UTC

[R] Coercing by/tapply to data.frame for more than two indices?

Dear Colleagues,

 	Apologies for a long email to ask what I feel may be a very simple
question; I figure it's better to overspecify my situation.

         I was asked a question, recently, by a colleague in my department
about pre-aggregating variables, i.e., computing the mean of defined subsets
of a data frame. Naturally, I thought of the 'by' and 'tapply'
functions, as
they have always been the solution for me. However, my colleague had three
indices, and as such needs to pay attention to the indices of the
output...this is to say, the "create an array" function of tapply
doesn't
quite work because an array is not quite what we want.

         Consider this data set:

df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
                  var2= factor(rep(rep(1:5,each=25*5),10),
                 trial= rep(rep(1:25,25),10),
                    id= factor(rep(1:10,each=5*5*25)),
                 score= rnorm(n=5*5*25*10) )

...this is to say, each of 10 ids has scores for 5 different levels of
var1 and 5 different levels of var2...across 25 trials. Basically, a
three-way crossed repeated measures design...where tapply does what I want
for a two-way design, it does not quite suit my purposes for a 3-way or
n-way for n > 2.

The goal is to predict score from var1 and var2. The straightforward guess
of what to do would be to simply have the AOV function aggregate across
trials:

aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)

(or lm with defined contrasts)

...however, there are missing data on some trials for some people, which
makes this design unbalanced (i.e., it introduces a correlation between var1
and var2). Because my colleague knows (from a theoretical standpoint) that
he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
be balanced, which is to say, the analysis he wants to run would produce
different output from the above.

So, what he needs is a data frame with four variables instead of five: var1,
var2, id, and mscore (mean score), which has been averaged across trials.

Clearly (to me, it seems), the way to do this is with tapply:

x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)

...which returns a var1*var2 matrix for each ID, when what I want is a
observation-per-row data frame.

So, my question: How do I end up with what I'm looking for?

My current process involves setting df2 <- data.frame(mscore=c(x), ...)
where ... is a bunch of factor(rep) columns that would specify the var1 var2
and id levels. My problem with this approach is that it seems like a hack;
it is not a general solution because I must use knowledge of the process by
which x was generated in order to "get it right," and there's a
decent
amount of room for unnoticed error on my part.

I suppose what I'm looking for is either a way to take by or tapply and have
it return a set of index variable columns based on the list of indices I
provide to it...or a way to collapse an n-way table into a single data frame
with index variables. Any suggestions?

Cordially,

Adam D. I. Kramer
Ph.D. Candidate, Social Psychology
University of Oregon

jim holtman

2008-May-03 05:20 UTC

head link

[R] Coercing by/tapply to data.frame for more than two indices?

?aggregate
> aggregate(df$score, list(df$var1, df$var2, df$id), mean, na.rm=TRUE)    Group.1 Group.2 Group.3             x
1         1       1       1  0.1053576980
2         2       1       1  0.1514888520
3         3       1       1  0.1270477403
4         4       1       1 -0.0193129404
5         5       1       1  0.2574346931
6         1       2       1  0.0185013523
7         2       2       1 -0.0886420632
8         3       2       1 -0.1304342272
9         4       2       1 -0.0972963702
10        5       2       1 -0.1463502593



On Fri, May 2, 2008 at 6:43 PM, Adam D. I. Kramer <adik at ilovebacon.org>
wrote:> Dear Colleagues,
>
>        Apologies for a long email to ask what I feel may be a very simple
> question; I figure it's better to overspecify my situation.
>
>        I was asked a question, recently, by a colleague in my department
> about pre-aggregating variables, i.e., computing the mean of defined
subsets
> of a data frame. Naturally, I thought of the 'by' and
'tapply' functions, as
> they have always been the solution for me. However, my colleague had three
> indices, and as such needs to pay attention to the indices of the
> output...this is to say, the "create an array" function of tapply
doesn't
> quite work because an array is not quite what we want.
>
>        Consider this data set:
>
> df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
>                 var2= factor(rep(rep(1:5,each=25*5),10),
>                trial= rep(rep(1:25,25),10),
>                   id= factor(rep(1:10,each=5*5*25)),
>                score= rnorm(n=5*5*25*10) )
>
> ...this is to say, each of 10 ids has scores for 5 different levels of
> var1 and 5 different levels of var2...across 25 trials. Basically, a
> three-way crossed repeated measures design...where tapply does what I want
> for a two-way design, it does not quite suit my purposes for a 3-way or
> n-way for n > 2.
>
> The goal is to predict score from var1 and var2. The straightforward guess
> of what to do would be to simply have the AOV function aggregate across
> trials:
>
> aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)
>
> (or lm with defined contrasts)
>
> ...however, there are missing data on some trials for some people, which
> makes this design unbalanced (i.e., it introduces a correlation between
var1
> and var2). Because my colleague knows (from a theoretical standpoint) that
> he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
> be balanced, which is to say, the analysis he wants to run would produce
> different output from the above.
>
> So, what he needs is a data frame with four variables instead of five:
var1,
> var2, id, and mscore (mean score), which has been averaged across trials.
>
> Clearly (to me, it seems), the way to do this is with tapply:
>
> x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)
>
> ...which returns a var1*var2 matrix for each ID, when what I want is a
> observation-per-row data frame.
>
> So, my question: How do I end up with what I'm looking for?
>
> My current process involves setting df2 <- data.frame(mscore=c(x), ...)
> where ... is a bunch of factor(rep) columns that would specify the var1
var2
> and id levels. My problem with this approach is that it seems like a hack;
> it is not a general solution because I must use knowledge of the process by
> which x was generated in order to "get it right," and there's
a decent
> amount of room for unnoticed error on my part.
>
> I suppose what I'm looking for is either a way to take by or tapply and
have
> it return a set of index variable columns based on the list of indices I
> provide to it...or a way to collapse an n-way table into a single data
frame
> with index variables. Any suggestions?
>
> Cordially,
>
> Adam D. I. Kramer
> Ph.D. Candidate, Social Psychology
> University of Oregon
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

Adam D. I. Kramer

2008-May-03 20:35 UTC

head link

[R] Coercing by/tapply to data.frame for more than two indices?

Dear Colleagues,

 	Apologies for a long email to ask what I feel may be a very simple
question; I figure it's better to overspecify my situation.

         I was asked a question, recently, by a colleague in my department
about pre-aggregating variables, i.e., computing the mean of defined subsets
of a data frame. Naturally, I thought of the 'by' and 'tapply'
functions, as
they have always been the solution for me. However, my colleague had three
indices, and as such needs to pay attention to the indices of the
output...this is to say, the "create an array" function of tapply
doesn't
quite work because an array is not quite what we want.

         Consider this data set:

df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
                  var2= factor(rep(rep(1:5,each=25*5),10),
                 trial= rep(rep(1:25,25),10),
                    id= factor(rep(1:10,each=5*5*25)),
                 score= rnorm(n=5*5*25*10) )

...this is to say, each of 10 ids has scores for 5 different levels of
var1 and 5 different levels of var2...across 25 trials. Basically, a
three-way crossed repeated measures design...where tapply does what I want
for a two-way design, it does not quite suit my purposes for a 3-way or
n-way for n > 2.

The goal is to predict score from var1 and var2. The straightforward guess
of what to do would be to simply have the AOV function aggregate across
trials:

aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)

(or lm with defined contrasts)

...however, there are missing data on some trials for some people, which
makes this design unbalanced (i.e., it introduces a correlation between var1
and var2). Because my colleague knows (from a theoretical standpoint) that
he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
be balanced, which is to say, the analysis he wants to run would produce
different output from the above.

So, what he needs is a data frame with four variables instead of five: var1,
var2, id, and mscore (mean score), which has been averaged across trials.

Clearly (to me, it seems), the way to do this is with tapply:

x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)

...which returns a var1*var2 matrix for each ID, when what I want is a
observation-per-row data frame.

So, my question: How do I end up with what I'm looking for?

My current process involves setting df2 <- data.frame(mscore=c(x), ...)
where ... is a bunch of factor(rep) columns that would specify the var1 var2
and id levels. My problem with this approach is that it seems like a hack;
it is not a general solution because I must use knowledge of the process by
which x was generated in order to "get it right," and there's a
decent
amount of room for unnoticed error on my part.

I suppose what I'm looking for is either a way to take by or tapply and have
it return a set of index variable columns based on the list of indices I
provide to it...or a way to collapse an n-way table into a single data frame
with index variables. Any suggestions?

Cordially,

Adam D. I. Kramer
Ph.D. Candidate, Social Psychology
University of Oregon

Reasonably Related Threads

Search for more apparently analagous threads

R help - May 2008 - Coercing by/tapply to data.frame for more than two indices?

[R] Coercing by/tapply to data.frame for more than two indices?

[R] Coercing by/tapply to data.frame for more than two indices?

[R] Coercing by/tapply to data.frame for more than two indices?

Reasonably Related Threads