Nantachai Kantanantha
2006-Jul-30 04:33 UTC
[R] Question about data used to fit the mixed model
Hi everyone, I would like to ask a question regarding to the data used to fit the mixed model. I wonder that, for the response variable data used to fit the mixed model (either via "spm" or "lme"), we must have several observations per subject (i.e. Yij, i = 1,..,M, j = 1,.., ni) or it can be just one observation per subject (i.e. Yi, i = 1,...,M). Since we have to specify the groups for random effect components, if we have only one observation per subject, then each group will have only one observation. Thank you vert much for your help. Sincerely yours, Nantachai
You can have one observation per subject with multiple subjects nested in a group. If you only have 1 observation per group, then there is no multilevel structure to your data. For example, 30 students in a classroom or 20 employees in an office division are appropriate data structures. On the other hand 1 observation per school in each of 30 schools has no grouping structure. If you look at some of the data in the mlmRev package or other data files in the nlme package and look at their structure, this might be helpful to see exactly how the data might be layed out. Look at the egsingle or the star data in the mlmRev package to see examples of longitudinal models where eac student has multiple test scores. In egsingle, each student is properly nested in a single school whereas in the star data, students are crossed with teachers and schools. Use the str(star) to see the data structure. Or, you can do something like head(star) to see the 1st 6 rows and see how the data are layed out. I hope this helps, Harold -----Original Message----- From: r-help-bounces@stat.math.ethz.ch on behalf of Nantachai Kantanantha Sent: Sun 7/30/2006 12:33 AM To: r-help@stat.math.ethz.ch Subject: [R] Question about data used to fit the mixed model Hi everyone, I would like to ask a question regarding to the data used to fit the mixed model. I wonder that, for the response variable data used to fit the mixed model (either via "spm" or "lme"), we must have several observations per subject (i.e. Yij, i = 1,..,M, j = 1,.., ni) or it can be just one observation per subject (i.e. Yi, i = 1,...,M). Since we have to specify the groups for random effect components, if we have only one observation per subject, then each group will have only one observation. Thank you vert much for your help. Sincerely yours, Nantachai ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
On 7/29/06, Nantachai Kantanantha <kantanantha at hotmail.com> wrote:> Hi everyone, > > I would like to ask a question regarding to the data used to fit the mixed > model. > > I wonder that, for the response variable data used to fit the mixed model > (either via "spm" or "lme"), we must have several observations per subject > (i.e. Yij, i = 1,..,M, j = 1,.., ni) or it can be just one observation per > subject (i.e. Yi, i = 1,...,M). Since we have to specify the groups for > random effect components, if we have only one observation per subject, then > each group will have only one observation.As Harold Doran mentioned in his earlier reply, if you only have one observation in each group you can't estimate the parameters in a mixed model because the random effect for a group is completely confounded with the per-observation noise term for the observation. The model would be of the form X\beta + Z b + \epsilon for which you would estimate the variance of the components of b and the variance of the components of \epsilon. However, with only one observation per group the number of components in b and in \epsilon would be the same and, by suitably reordering the observations, the matrix Z could be made to be an identity matrix. Thus the model reduces to X\beta + (b + \epsilon) and the elements of b are confounded with those of \epsilon. A different version of this question is to ask whether some of the groups can have only a single observation while others have more that one observation. The answer to that is a qualified "yes". An example of data with different numbers of observations per group is the star data that Harold mentioned. The "student" identifier in this data set is named "id". If we table the number of observations per student then table that result we get a table of the number of students with 1, 2, 3 or 4 observations.> data("star", package = 'mlmRev') > table(table(star$id))1 2 3 4 4314 2455 1744 3085> length(unique(star$id))[1] 11598> 4314/11598[1] 0.3719607 This shows that more than a third of the students have data from only a single year. It is possible to include such students in a mixed model with a random effect for student. It is even possible to include such students in a mixed model with a random intercept and a random slope with respect to time for student. However, such students contribute very little information to the model fit and the "estimates" (actually "predictors") of the random effects for such students are artificially small because they are confounded with the per-observation noise term. So while it can be attractive when designing an experimental or planning a observational study to have many groups and few observations per group, such experiments or studies provide very sparse information. Using a mixed model on such data doesn't magically add information to the data. Mixed models are statistical models, not magic.