What R functionnalities are there to do missing values imputation (substantial proportion of missing data)? I would prefer to use maximum likelihood methods ; is the EM algorithm implemented? in which package? Thanks Anne ---------------------------------------------------- Anne Piotet Tel: +41 79 359 83 32 (mobile) Email: anne.piotet@m-td.com --------------------------------------------------- M-TD Modelling and Technology Development PSE-C CH-1015 Lausanne Switzerland Tel: +41 21 693 83 98 Fax: +41 21 646 41 33 -------------------------------------------------- [[alternative HTML version deleted]]
Anne Piotet wrote:> What R functionnalities are there to do missing values imputation > (substantial proportion of missing data)? I would prefer to use > maximum likelihood methods ; is the EM algorithm implemented? in > which package?The so-called ``EM algorithm'' is ***NOT*** an algorithm. It is a methodology or a unifying concept. It would be impossible to ``implement'' it. (Except possibly by means of some extremely advanced and sophisticated Artificial Intelligence software.) cheers, Rolf Turner rolf at math.unb.ca
On 12-May-04 Anne wrote:> What R functionnalities are there to do missing values imputation > (substantial proportion of missing data)? > I would prefer to use maximum likelihood methods ; is the EM algorithm > implemented? in which package?Hi Anne, R already has packages/libraries called "cat", "norm" and "mix" which, while they are not part of the standard installation, can be readily downloaded and installed from any CRAN website -- see under "contributed sources". These implement in R Schafer's S code for what he calls "CAT", "NORM" and "MIX". These are for imputing missing data where the data are respectively entirely categorical, entirley continous ("norm" operates on the basis that the data are a sample from a multivariate normal distribution) and a mixture of both (some variables categorical, some continuous). All include routines for multiple imputation, and for extracting appropriate information about the parameters from the imputations. Schafer also has an S function "PAN" which imoputes missing values from "panel" data. I don;t think this has been implemented for R yet. There is one type of data which also, I think, has nothing implemented for R (and I have not heard of a specially written routine for S-plus either). This is so-called "semi-continuous" data -- where the value of a variable may either be "continuous" or else take a specific value (typically zero). E.g. "How much did you spend on alcohol last week?" -- answer may be a positive amount, maybe log-normally distributed, or else zero. You can approach data of this kind with missing values by combining "cat" and "norm", but it's tricky and may not correspond to a valid model. All of Schafer's methods use maximum-likelihood estimation of the parameters for the first phase of the imputation, using the EM algorithm (and I'll respond to Rolf Turner's comments shortly). After that, you can make a simple imputation by sampling from the distribution thus estimated, or in a more general and indeed sounder way, first sample from the posterior parameter distribution, sample imputed values from the resulting distribution, and then repeat sampling from parameters and resulting distributions to build up an array of datasets with the missing data filled in by multiple imputation. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 12-May-04 Time: 17:44:50 ------------------------------ XFMail ------------------------------
> From: Rolf Turner > > Anne Piotet wrote: > > > What R functionnalities are there to do missing values imputation > > (substantial proportion of missing data)? I would prefer to use > > maximum likelihood methods ; is the EM algorithm implemented? in > > which package? > > The so-called ``EM algorithm'' is ***NOT*** an > algorithm. It is a methodology or a unifying concept. > It would be impossible to ``implement'' it. (Except > possibly by means of some extremely advanced and > sophisticated Artificial Intelligence software.)Yes, but EM for missing value imputation is a bit narrower, I guess. At least the `norm' package on CRAN has em.norm() for multivariate gaussian... Andy> cheers, > > Rolf Turner > rolf at math.unb.ca > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Thanks Brian. The EM algorithm requires an ``E'' step and an ``M'' step. Harding and Rossini appear to be seriously suggesting that an R function could be written which would (a) Perform the E step in arbitrary contexts, and (b) For that given expected value, work out a procedure to effect its maximization. Or maybe they're not serious. For the M step (b) general numerical optimization would theoretically do the trick. (But would be fraught with peril.) For the E step (a), forget it. The point is, the EM ``algorithm'' is NOT an algorithm which could be effected by an R function. This is in complete contrast with integrate() --- it's there; the code is written. Hand integrate() an integration problem, and it'll do it. One of the differences is that the input to an itegration problem is clearly defined and readily specifiable as an R function. The input to a general missing values problem is amorphous. Arguing about what constitutes an algorithm according to some abstract definition is mindless. If you define ``algorithm'' to suit yourself, then the EM algorithm is an algorithm; otherwise not. The original questioner wanted an R function to effect the EM algorithm. My point was that this is a silly request because such a function would be impossible to write. Call the EM algorithm an algorithm if it makes you happy. But remember that by doing so you'll mislead the naive inquirer who will expect there to be a real live implementation of that algorithm. In computer (R) code. Like integrate(). If you can write an R function to effect the EM ``algorithm'' --- in general, not just in a special case --- you'll win the Chambers Prize in computing and a few other things as well. cheers, Rolf Turner rolf at math.unb.ca