lifty.gere at gmx.de
2011-Aug-01  07:03 UTC
[R] Impact of multiple imputation on correlations
Dear all, I have been attempting to use multiple imputation (MI) to handle missing data in my study. I use the mice package in R for this. The deeper I get into this process, the more I realize I first need to understand some basic concepts which I hope you can help me with. For example, let us consider two arbitrary variables in my study that have the following missingness pattern: Variable 1 available, Variable 2 available: 51 (of 118 observations, 43%) Variable 1 available, Variable 2 missing: 37 (31,3%) Variable 1 missing, Variable 2 available: 10 (8,4%) Variable 1 missing, Variable 2 missing: 20 (16,9%) I am interested in the correlation between Variable 1 and Variable 2. Q1. Does it even make sense for me to use MI (or anything else, really) to replace my missing data when such large fractions are not available? Plot 1 (http://imgur.com/KFV9y&CmV1sl) provides a scatter plot of these example variables in the original data. The correlation coefficient r = -0.34 and p = 0.016. Q2. I notice that correlations between variables in imputed data (pooled estimates over all imputations) are much lower and less significant than the correlations in the original data. For this example, the pooled estimates for the imputed data show r = -0.11 and p = 0.22. Since this seems to happen in all the variable combinations that I have looked at, I would like to know if MI is known to have this behavior, or whether this is specific to my imputation. Q3. When going through the imputations, the distribution of the individual variables (min, max, mean, etc.) matches the original data. However, correlations and least-square line fits vary quite a bit from imputation to imputation (see Plot 2, http://imgur.com/KFV9yl&CmV1s). Is this normal? Q4. Since my results differ (quite significantly) between the original and imputed data, which one should I trust? Thank you for your help in advance. Tina --
Hi Tina, That is quite a bit of missingness, especially considering the sample size is not large to begin with. This would make me treat *any* result cautiously. That said, if you have a reasonable idea what the mechanism causing the missingness is or if from additional variables in your study, you can model the missing data mechanism sufficiently that you are confident (for some definition of confident) that the missingness is random after accounting for your model (conditional independence, I forget if Rubin calls it MCAR or MAR), you are in a reasonable place to use MI and draw inferences from the results. Even if you are uncertain about this, it is *not* any better to just say, "well there was too much missing data for me to feel safe using MI so here is the correlation based just on the observed data". That _will be biased_ unless the missing data mechanism is completely random (even unconditioned on anything else in your study; for example if participants flipped coins to decide which questions to respond to). When averaging correlations, it is conventional to average the inverse hyperbolic function of the correlations and then use the hyperbolic function to transform the averaged value back to the original units (also known as Fisher's Z transformation). The mice package may do this automatically if there is a functiong to compute pooled correlations. How results between simply deleted cases with any value unobserved and using MI varies. There may be no difference, are larger difference, or a smaller difference. Looking at the scatter plot matrix from the different imputations, I do not know that I would actually classify that as varying quite a bit. I realize the sign of the slope changes some, but that is not too surprising because all of them are somewhat close to flat. You can compare the between imputation variance to the within imputation variance (I think mice gives you this information). I partly addressed your last question at the beginning---I would certainly not trust the correlation obtained simply by deleting missingness, but I also would not trust the result obtained using MI unless it was well setup. Although you have shown us some of the data, you have not mentioned how you modelled the missingness. This can have a substantial impact on your results (and also their trustworthyness). mice provides a number of different models and you have a choice in what variables you use if you collect a lot in your study. Given all of this, I would suggest finding a local statistician or consultant to talk with about this. Your question(s) are more statistical than they are R related. Also, in addition to learning more about MI (there are several good books and articles on it that you can look up or email me offlist and I can provide references if you want), someone who is there can be more helpful because they will have access to your whole dataset and can work with you to find the best variables/model to model the missing data mechanism. I hope this helps and good luck, Josh On Mon, Aug 1, 2011 at 12:03 AM, <lifty.gere at gmx.de> wrote:> Dear all, > > I have been attempting to use multiple imputation (MI) to handle missing data in my study. I use the mice package in R for this. The deeper I get into this process, the more I realize I first need to understand some basic concepts which I hope you can help me with. > > For example, let us consider two arbitrary variables in my study that have the following missingness pattern: > > Variable 1 available, Variable 2 available: 51 (of 118 observations, 43%) > Variable 1 available, Variable 2 missing: 37 (31,3%) > Variable 1 missing, Variable 2 available: 10 (8,4%) > Variable 1 missing, Variable 2 missing: 20 (16,9%) > > I am interested in the correlation between Variable 1 and Variable 2. > > Q1. Does it even make sense for me to use MI (or anything else, really) to replace my missing data when such large fractions are not available? > > Plot 1 (http://imgur.com/KFV9y&CmV1sl) provides a scatter plot of these example variables in the original data. The correlation coefficient r = -0.34 and p = 0.016. > > Q2. I notice that correlations between variables in imputed data (pooled estimates over all imputations) are much lower and less significant than the correlations in the original data. For this example, the pooled estimates for the imputed data show r = -0.11 and p = 0.22. > > Since this seems to happen in all the variable combinations that I have looked at, I would like to know if MI is known to have this behavior, or whether this is specific to my imputation. > > Q3. When going through the imputations, the distribution of the individual variables (min, max, mean, etc.) matches the original data. However, correlations and least-square line fits vary quite a bit from imputation to imputation (see Plot 2, http://imgur.com/KFV9yl&CmV1s). Is this normal? > > Q4. Since my results differ (quite significantly) between the original and imputed data, which one should I trust? > > Thank you for your help in advance. > Tina > -- > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/
Seemingly Similar Threads
- ordered logistic regression of survey data with missing variables
- Multiple imputation, especially in rms/Hmisc packages
- imputation in mice
- Plotting survival curves after multiple imputation
- multiple imputation with fit.mult.impute in Hmisc - how to replace NA with imputed value?