On 25-Sep-06 Eleni Rapsomaniki wrote:> Hi
>
> I am trying to impute missing values for my data.frame. As I
> intend to use the complete data for prediction I am currently
> measuring the success of an imputation method by its resulting
> classification error in my training data.
>
> I have tried several approaches to replace missing values:
> - mean/median substitution
> - substitution by a value selected from the observed values of
> a variable
> - MLE in the mix package
> - all available methods for numerical data in the MICE package
> (ie. pmm, sample, mean and norm)
>
> I found that the least classification error results using mice
> with the "mean" option for numerical data. However, I am not
> sure how the "mean" multiple imputatation differs from the simple
> mean substitution. I tried to read some of the documentation
> supporting the R package, but couldn't find much theory about
> the "mean" imputation method.
>
> Are there any good papers to explain the background behind each
> imputation option in MICE?
I agree that the MICE documentation tends to be silent about some
imporant questions, both in the R/S "help" pages, and also in the
MICE user's manual which can be found at
http://web.inter.nl.net/users/S.van.Buuren/mi/docs/Manual.pdf
Possibly it could be worth looking at some of the "other relevant
reports" listed at
http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htm
but they do not look very hopeful.
That being said, my understanding relating to your query is
(glossing over the technicalities of the Gibbs sampling methods
used in (b))
a) mean/median substitution relates to the very basic method
of substituting, for a missing value, the arithmetic mean
of the non missing values for that variable, possibly
with selection of cases with non-missing values so as to
approximately match the observed covariates of the case
being imputed.
b) "mean" imputation in MICE (as far as I can infer it) means
that the distribution of the missing value (conditional
on its observed covariates) is inferred from the cases
with non-missing values, and the mean of this conditional
distribution is subsitutedfor the missing value.
These two approaches will in general give different results.
Some further comments.
1. I would suggest that you consider the full multiple imputation
approach. Filling in missing values just once, and then using
the completed results (for predicition, in your case) in some
procedure which treats them as though they were observed values,
will not take into account the uncertainty as to what values
they should have (as opposed to the values they were imputed to have).
Whe multiple imputation is used, the variation from imputation
to imputation in the imputed values will represent this
uncertainty, and so a more realistic picture of the overall
uncertainty of prediction can be obtained.
2. You stated that one method tried was "MLE in the mix package".
MLE (maximum likelihood estimation) using the EM algorithm is
implemented in the mix functions em.mix and ecm.mix, but neither
of these produces values to substitute for missing data. The
result is essentially just parameter estimation by MLE based
on the incomplete data.
Values to substitute for missing data are produced by other
functions, such as imp.mix; but these are randomly sampled
from the conditional distributions of the missing values and
therefore, each time it is done, the results are different.
In particular, the first value you sample will be random.
Hence the values you impute will be more or less good, in
terms of your "training set", depending on the "luck of the
draw" when you use (say) imp.mix.
I don't know if I have understood what you meant by "MLE in
the mix package", but if the above is a correct understanding
then the remarks under (1) apply: in particular, as just
noted, that comparing a single imputation with your "training
set" is an uncertain comparison.
Hoping this helps,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 25-Sep-06 Time: 15:33:59
------------------------------ XFMail ------------------------------