Ulrich Keller
2007-Jul-17 12:58 UTC
[R] Multiple imputation with plausible values already in the data
Hello, this is not really an R-related question, but since the posting guide does not forbid asking non-R questions (even encourages it to some degree), I though I'd give it a try. I am currently doing some secondary analyses of the PISA (http://pisa.oecd.org) student data. I would like to treat missing values properly, that is using multiple imputation (with the mix package). But I am not sure how to do the imputation, since the data set provided by the OECD already contains variables with plausible values. Roughly, the situation is like this: for each of the cognitive (achievement) scales, there are five variables holding plausible values. So for example, there is not one variable for math achievement, but five, pv1math through pv5math. There are, of course, no missing values on these variables. Most other variables show some degree of missing data. For example, some students did not report their parents' occupation, so there is no information about the socio-economic background (HISEI). This is the kind of data I want to impute. My first thought was splitting the data into five datasets, each holding only one of the plausible value variables, but all of the "normal" variables. So e.g. the first data set would include pv1math, pv1read, HISEI, and gender; while the second would include pv2math, pv2read, HISEI, and gender. I would run mix on the five data sets independently and end up with five imputed data sets with no missing values. But is this a valid approach? There would actually be two imputation runs per data set: one for the plausible values on the achievement scales (done by the OECD under an unknown model), and one for the other variables (done by me with mix). The second run would use data from the first. Would this not lead to an overestimation of the imputation variance? What alternative approaches are there? Thank you in advance for you answers, Uli