Hi I am trying to impute missing values for my data.frame. As I intend to use the complete data for prediction I am currently measuring the success of an imputation method by its resulting classification error in my training data. I have tried several approaches to replace missing values: - mean/median substitution - substitution by a value selected from the observed values of a variable - MLE in the mix package - all available methods for numerical data in the MICE package (ie. pmm, sample, mean and norm) I found that the least classification error results using mice with the "mean" option for numerical data. However, I am not sure how the "mean" multiple imputatation differs from the simple mean substitution. I tried to read some of the documentation supporting the R package, but couldn't find much theory about the "mean" imputation method. Are there any good papers to explain the background behind each imputation option in MICE? I would really appreciate any comments on the above, as my understanding of statistics is very limited. Many thanks Eleni Rapsomaniki Birkbeck College, UK

You might find something useful at this web site: http://www.multiple-imputation.com/ On 25/09/06, Eleni Rapsomaniki <e.rapsomaniki at mail.cryst.bbk.ac.uk> wrote:> > Hi > > I am trying to impute missing values for my data.frame. As I intend to use the > complete data for prediction I am currently measuring the success of an > imputation method by its resulting classification error in my training data. > > I have tried several approaches to replace missing values: > - mean/median substitution > - substitution by a value selected from the observed values of a variable > - MLE in the mix package > - all available methods for numerical data in the MICE package (ie. pmm, sample, > mean and norm) > > I found that the least classification error results using mice with the "mean" > option for numerical data. However, I am not sure how the "mean" multiple > imputatation differs from the simple mean substitution. I tried to read some of > the documentation supporting the R package, but couldn't find much theory about > the "mean" imputation method. > > Are there any good papers to explain the background behind each imputation > option in MICE? > > I would really appreciate any comments on the above, as my understanding of > statistics is very limited. > > Many thanks > Eleni Rapsomaniki > Birkbeck College, UK > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- ================================David Barron Said Business School University of Oxford Park End Street Oxford OX1 1HP

On 25-Sep-06 Eleni Rapsomaniki wrote:> Hi > > I am trying to impute missing values for my data.frame. As I > intend to use the complete data for prediction I am currently > measuring the success of an imputation method by its resulting > classification error in my training data. > > I have tried several approaches to replace missing values: > - mean/median substitution > - substitution by a value selected from the observed values of > a variable > - MLE in the mix package > - all available methods for numerical data in the MICE package > (ie. pmm, sample, mean and norm) > > I found that the least classification error results using mice > with the "mean" option for numerical data. However, I am not > sure how the "mean" multiple imputatation differs from the simple > mean substitution. I tried to read some of the documentation > supporting the R package, but couldn't find much theory about > the "mean" imputation method. > > Are there any good papers to explain the background behind each > imputation option in MICE?I agree that the MICE documentation tends to be silent about some imporant questions, both in the R/S "help" pages, and also in the MICE user's manual which can be found at http://web.inter.nl.net/users/S.van.Buuren/mi/docs/Manual.pdf Possibly it could be worth looking at some of the "other relevant reports" listed at http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htm but they do not look very hopeful. That being said, my understanding relating to your query is (glossing over the technicalities of the Gibbs sampling methods used in (b)) a) mean/median substitution relates to the very basic method of substituting, for a missing value, the arithmetic mean of the non missing values for that variable, possibly with selection of cases with non-missing values so as to approximately match the observed covariates of the case being imputed. b) "mean" imputation in MICE (as far as I can infer it) means that the distribution of the missing value (conditional on its observed covariates) is inferred from the cases with non-missing values, and the mean of this conditional distribution is subsitutedfor the missing value. These two approaches will in general give different results. Some further comments. 1. I would suggest that you consider the full multiple imputation approach. Filling in missing values just once, and then using the completed results (for predicition, in your case) in some procedure which treats them as though they were observed values, will not take into account the uncertainty as to what values they should have (as opposed to the values they were imputed to have). Whe multiple imputation is used, the variation from imputation to imputation in the imputed values will represent this uncertainty, and so a more realistic picture of the overall uncertainty of prediction can be obtained. 2. You stated that one method tried was "MLE in the mix package". MLE (maximum likelihood estimation) using the EM algorithm is implemented in the mix functions em.mix and ecm.mix, but neither of these produces values to substitute for missing data. The result is essentially just parameter estimation by MLE based on the incomplete data. Values to substitute for missing data are produced by other functions, such as imp.mix; but these are randomly sampled from the conditional distributions of the missing values and therefore, each time it is done, the results are different. In particular, the first value you sample will be random. Hence the values you impute will be more or less good, in terms of your "training set", depending on the "luck of the draw" when you use (say) imp.mix. I don't know if I have understood what you meant by "MLE in the mix package", but if the above is a correct understanding then the remarks under (1) apply: in particular, as just noted, that comparing a single imputation with your "training set" is an uncertain comparison. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 25-Sep-06 Time: 15:33:59 ------------------------------ XFMail ------------------------------