Mark Seeto
2010-Aug-10 01:24 UTC
[R] Multiple imputation, especially in rms/Hmisc packages
Hello, I have a general question about combining imputations as well as a question specific to the rms and Hmisc packages. The situation is multiple regression on a data set where multiple imputation has been used to give M imputed data sets. I know how to get the combined estimate of the covariance matrix of the estimated coefficients (average the M covariance matrices from the individual imputations and add (M+1)/M times the between-imputation covariance matrix), and I know how to use this to get p-values and confidence intervals for the individual coefficients. What I don't know is how to combine imputations to obtain the multiple degree-of-freedom tests to test whether a set of coefficients are all 0. If there were no missing values, this would be done using the "general linear test" which involves fitting the full model and the reduced model and getting an F statistic based on the residual sums of squares and degrees of freedom. How does one correctly do this when there are multiple imputations? In the language of Hmisc, I'm asking how the values in anova(fmi) are calculated if fmi is the result of fit.mult.impute (I've looked at the anova.rms code but couldn't understand it). The second question I have relates to rms/Hmisc specifically. When I use fit.mult.impute, the fitted values don't appear to match the values given by the predict function on the original data. Instead, they match the fitted values of the last imputation. For example, library(rms) set.seed(1) x1 <- rnorm(40) x2 <- 0.5*x1 + rnorm(40,0,3) y <- x1^2 + x2 + rnorm(40,0,3) x2[40] <- NA # create 1 missing value in x2 a <- aregImpute(~ y + x1 + x2, n.impute=2, nk=3) # 2 imputations x2.imp1 <- x2; x2.imp1[40] <- a$imputed$x2[,1] # first imputed x2 vector x2.imp2 <- x2; x2.imp2[40] <- a$imputed$x2[,2] # second imputed x2 vector ols.imp1 <- ols(y ~ rcs(x1,3) + x2.imp1) # model on imputation 1 ols.imp2 <- ols(y ~ rcs(x1,3) + x2.imp2) # model on imputation 2 d <- data.frame(y, x1, x2) fmi <- fit.mult.impute(y ~ rcs(x1,3) + x2, ols, a, data=d) fmi$fitted predict(fmi, d) I would have expected the results of fmi$fitted and predict(fmi,d) to be the same except for the final NA, but they are not. Instead, fmi$fitted is the same as ols.imp2$fitted. Also, anova(ols.imp2) anova(fmi) unexpectedly give the same result in the ERROR line. I would be most grateful if anyone could explain this to me. Thanks, Mark -- Mark Seeto Statistician National Acoustic Laboratories <http://www.nal.gov.au/> A Division of Australian Hearing
Frank Harrell
2010-Aug-10 01:37 UTC
[R] Multiple imputation, especially in rms/Hmisc packages
On Mon, 9 Aug 2010, Mark Seeto wrote:> Hello, I have a general question about combining imputations as well as a > question specific to the rms and Hmisc packages. > > The situation is multiple regression on a data set where multiple > imputation has been used to give M imputed data sets. I know how to get > the combined estimate of the covariance matrix of the estimated > coefficients (average the M covariance matrices from the individual > imputations and add (M+1)/M times the between-imputation covariance > matrix), and I know how to use this to get p-values and confidence > intervals for the individual coefficients. > > What I don't know is how to combine imputations to obtain the multiple > degree-of-freedom tests to test whether a set of coefficients are all 0. > If there were no missing values, this would be done using the "general > linear test" which involves fitting the full model and the reduced > model and getting an F statistic based on the residual sums of squares and > degrees of freedom. How does one correctly do this when there are multiple > imputations? In the language of Hmisc, I'm asking how the values in > anova(fmi) are calculated if fmi is the result of fit.mult.impute (I've > looked at the anova.rms code but couldn't understand it).In ordinary regression, the "leave some variables out and compare R^2"-based F-test and the quicker contrast tests are equivalent. In general we use the latter, which is expedient since it is easy to estimate the imputation-corrected covariance matrix. In the rms and Hmisc packages, fit.mult.impute creates the needed fit object, then anova, contrast, summary, etc. give the correct Wald tests. >> The second question I have relates to rms/Hmisc specifically. When I use > fit.mult.impute, the fitted values don't appear to match the values given > by the predict function on the original data. Instead, they match the > fitted values of the last imputation. For example, > > library(rms) > set.seed(1) > x1 <- rnorm(40) > x2 <- 0.5*x1 + rnorm(40,0,3) > > y <- x1^2 + x2 + rnorm(40,0,3) > x2[40] <- NA # create 1 missing value in x2 > a <- aregImpute(~ y + x1 + x2, n.impute=2, nk=3) # 2 imputations > > x2.imp1 <- x2; x2.imp1[40] <- a$imputed$x2[,1] # first imputed x2 vector > x2.imp2 <- x2; x2.imp2[40] <- a$imputed$x2[,2] # second imputed x2 vector > > ols.imp1 <- ols(y ~ rcs(x1,3) + x2.imp1) # model on imputation 1 > ols.imp2 <- ols(y ~ rcs(x1,3) + x2.imp2) # model on imputation 2 > > d <- data.frame(y, x1, x2) > fmi <- fit.mult.impute(y ~ rcs(x1,3) + x2, ols, a, data=d) > > fmi$fitted > predict(fmi, d) > > I would have expected the results of fmi$fitted and predict(fmi,d) to be > the same except for the final NA, but they are not. Instead, fmi$fitted is > the same as ols.imp2$fitted. Also, > > anova(ols.imp2) > anova(fmi) > > unexpectedly give the same result in the ERROR line. I would be most > grateful if anyone could explain this to me.I need to add some more warnings. For many of the calculations, the last imputation is used. Frank> > Thanks, > Mark > -- > Mark Seeto > Statistician > > National Acoustic Laboratories <http://www.nal.gov.au/> > A Division of Australian Hearing > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >