Emmanuel Charpentier
2009-Apr-22 22:31 UTC
[R] Multiple imputations : wicked dataset ? Wicked computers ? Am I cursed ? (or stupid ?)
Dear list, I'd like to use multiple imputations to try and save a somewhat badly mangled dataset (lousy data collection, worse than lousy monitoring, you know that drill... especially when I am consulted for the first time about one year *after* data collection). My dataset has 231 observations of 53 variables, of which only a very few has no missing data. Most variables have 5-10% of missing data, but the whole dataset has only 15% complete cases (40% when dropping the 3 worst cases, which might be regained by other means). To try my hand at multiple imputation, I created a small extract of the original dataset :>DataTest<-Data2[c("ctr","cadredp","persmob","trea","taller","tbox","nbpradio","nbpventil","sitrea")]> dim(DataTest)[1] 231 9> sapply(DataTest,data.class)ctr cadredp persmob trea taller tbox nbpradio nbpventil "factor" "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" sitrea "ordered"> sapply(DataTest,function(x)sum(is.na(x)))ctr cadredp persmob trea taller tbox nbpradio nbpventil 0 5 17 13 5 17 27 60 sitrea 0> sum(complete.cases(DataTest))[1] 147> table(apply(is.na(DataTest),1,sum))0 1 2 3 4 5 147 45 22 14 2 1>I tried five different packages for generating multiple imputations in order to assess their respective ease of use. I am aware of the differences in algorithms, but I have (currently) no reason to favor or exclude one. *No* *single* *one* was able to create but one imputation set ! Suspecting a possible hardware/installation problem, I repeated the same sequence on another computer ... with identical results. Here are the gory details (after each try, I left R with q("no") and restarted a fresh session. ESS is your friend...) : Common characteristics :> sessionInfo()R version 2.9.0 (2009-04-17) x86_64-pc-linux-gnu locale: LC_CTYPE=fr_FR.UTF-8;LC_NUMERIC=C;LC_TIME=fr_FR.UTF-8;LC_COLLATE=fr_FR.UTF-8;LC_MONETARY=C;LC_MESSAGES=fr_FR.UTF-8;LC_PAPER=fr_FR.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_FR.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] datasets grDevices graphics stats utils methods base other attached packages: [1] MASS_7.2-46 boot_1.2-36 lme4_0.999375-28 Matrix_0.999375-25 [5] odfWeave_0.7.10 XML_2.3-0 lattice_0.17-22 loaded via a namespace (and not attached): [1] grid_2.9.0 nlme_3.1-90>Note : R has been recently obtained from the CRAN repository with Ubuntu 8.10. The packages are mostly hand compiled (install.packages(...)). Frank Harrel's aregImpute (in package Hmisc) :> system.time(DataTest.arI<-aregImpute(~ctr+cadredp+persmob+trea+taller+tbox+nbpradio+nbpventil+sitrea,data=DataTest,n.impute=5)) Erreur dans matxv(X, xcof) : columns in a (35) must be <= length of b (33) De plus : Warning message: In f$xcoef[, 1] * f$xcenter : la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court Timing stopped at: 0.028 0 0.028>Examples #1 and #2 of the manual gave expected results. Package "mi" (Andrew Gelman and a Columbia team, early development (version=0.06.something), but manual advertises nice features) :> system.time(Data2.mi<-mi(DataTest,n.imp=5))Erreur dans dimnames(AveVar) <- list(NULL, NULL, c(paste("mean(", sim.varnames, : la longueur de 'dimnames' [3] n'est pas ?gale ? l'?tendue du tableau Timing stopped at: 0.08 0 0.08>"textbook example" (example for function mi) : did not converge after6 iterations. Did not converge after 50 iterations. Package "Amelia" (Gary King and, apparently, the Zelig gang) :> system.time(DataTest.amelia<-amelia(DataTest,m=5,+ ords=names(DataTest)[sapply(DataTest, + data.class)=="ordered"], + noms=names(DataTest)[sapply(DataTest, + data.class)=="factor"])) -- Imputation 1 -- 1 2 Erreur dans am.inv(a = g11) : le mineur dominant d'ordre 22 n'est pas d?fini positif De plus : Warning message: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, : The number of catagories in one of the variables marked nominal has greater than 10 categories. Check nominal specification. Timing stopped at: 0.048 0 0.048>Note : various repeats crashed at various points (one of tre trials started to simulate, and stopped at obs 177...), but the numerical accuracy seemed involved every time. The "textbook example" (example for amelia function) runs perfectly. Package "mice" (Stef van Buuren & Karin Groothuis-Oudshoorn) :> system.time(DataTest.mice<-mice(DataTest,m=5))iter imp variable 1 1 cadredpErreur dans mice.impute.logreg(c(1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, : dims [produit 28] ne correspond pas ? la longueur de l'objet [31] De plus : Warning message: In beta + rv %*% rnorm(ncol(rv)) : la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court Timing stopped at: 0.092 0 0.094>The "textbook example" (example for mice) runs almost perfectly (warnings in the second example) Package "mix" (Joseph Schafer, maintained by Brian Ripley) :> # mix needs dataset columns sorted and counted > p<-ncol(DTM<-DataTest[sapply(DataTest,is.factor)]) > DTM<-cbind(DTM,DataTest[!sapply(DataTest,is.factor)]) > # Preliminary grouping and sorting > system.time(s<-prelim.mix(DTM,p))user system elapsed 0.012 0.000 0.009 Warning messages: 1: In storage.mode(w) <- "integer" : NAs introduits lors de la conversion automatique 2: In max(w[!is.na(w[, j]), j]) : aucun argument pour max ; -Inf est renvoy? 3: NAs introduits lors de la conversion automatique 4: In max(w[!is.na(w[, j]), j]) : aucun argument pour max ; -Inf est renvoy? 5: NAs introduits lors de la conversion automatique 6: In max(w[!is.na(w[, j]), j]) : aucun argument pour max ; -Inf est renvoy? 7: NAs introduits lors de la conversion automatique> # Parameter estimation > system.time(thetahat<-em.mix(s,showits=TRUE))Erreur dans rep(prior, s$ncells) : argument 'times' incorrect Timing stopped at: 0 0 0 ### indeed : s$ncells is NA ! ### Therefore, the rest crashes :> # Data augmentation proper > system.time(newtheta <- da.mix(s, thetahat, steps=100, showits=TRUE))Erreur dans rep(prior, s$ncells) : argument 'times' incorrect Timing stopped at: 0 0 0> # Single imputation... > system.time(ximp1 <- imp.mix(s, newtheta))Erreur dans imp.mix(s, newtheta) : objet 'newtheta' introuvable Timing stopped at: 0.004 0 0.001>The "textbook example" (example for da.mix) runs perfectly (and fast !). ======================================================================== I might be particularly stupid and misunderstanding manual and the "textbook" examples of one or two packages, but five ! Visual/manual/graphical examination of my dataset does not suggest an explanation. I am not aware of anything "fishy" in my setup (one of the machines is a bit aging, but the one used for the reported tests is almost new. Could someone offer an explanation ? Or must I recourse to sacrifying a black goat at midnight next new moon ? Any hint will be appreciated (and by the way, next new moon is in about 3 days...). Emmanuel Charpentier
Emmanuel Charpentier
2009-Apr-27 19:49 UTC
[R] Multiple imputations : wicked dataset. Need advice for follow-up to a possible solution.
Answering to myself (for future archive users' sake), more to come (soon) : Le jeudi 23 avril 2009 ? 00:31 +0200, Emmanuel Charpentier a ?crit :> Dear list, > > I'd like to use multiple imputations to try and save a somewhat badly > mangled dataset (lousy data collection, worse than lousy monitoring, you > know that drill... especially when I am consulted for the first time > about one year *after* data collection). > > My dataset has 231 observations of 53 variables, of which only a very > few has no missing data. Most variables have 5-10% of missing data, but > the whole dataset has only 15% complete cases (40% when dropping the 3 > worst cases, which might be regained by other means).[ Big snip ... ] It turns out that my problems were caused by ... the dataset. Two very important variables (i. e. of strong influence on the outcomes and proxies) are ill-distributed : - one is a modus operandi (two classes) - the second is center (23 classes, alas...) My data are quite ill-distributed : some centers have contributed a large number of observations, some other very few. Furthermore, while few variables are quite badly known, the "missingness pattern" is such as : - some centers have no directly usable information (= complete cases) under one of the modi operandi - some other have no complete case at all... Therefore, any model-based prediction method using the whole dataset (recommended for multiple imputations, since one should not use for inference a richer set of data than what was imputed (seen this statement in a lot of references)) fails miserably. Remembering some fascinating readings (incl. V&R) and an early (20 years ago) excursion in AI (yes, did that, didn't even got the T-shirt...), I have attempted (with some success) to use recursive partitioning for prediction. This (non-)model has some very interestind advantages in my case : - model-free - distribution-free (quite important here : you should see my density curves... and I won't speak about the outliers !) - handles missing data gracefully (almost automagically) - automatic selection and ranking of the pertinent variables - current implementation in R has some very nice features, such as surrogate splits if a value is missing, auto-pruning by cross-validation, etc ... It has also some drawbacks : - no (easy) method for inference - not easy to abstract (you can't just publish an ANOVA table and a couple of p-values...) - no "well-established" (i. e. acknowledged by journal reviewers) => difficult to publish These latter point do not bother me in my case. So I attempted to use this for imputation. Since mice is based on a "chained equations" approach and allows the end-user to write its own imputation functions, I wrote a set of such imputers to be called within the framework of the Gibbs sampler. They proceed as follow : - build a regression or classification tree of the relevant variable using the rest of the dataset - predict the relevant variable for *all* the dataset, - compute "residuals" from known values of the relevant variable and their prediction - impute values to missing data as prediction + a random residual. It works. It's a tad slower than prediction using normal/logistic/multinomial modelling (about a factor of 3, but for y first trial, I attempted to err on the side of excessive precision ==> deeper trees). It does not exhibit any "obvious" statistical misfeatures. But I have questions : 1) What is known of such imputation by regression/classification trees (aka recursive partitionning) ? A quick research didn't turn up very much : the idea has been evoked here and there, but I am not aware of any published solution. In particular, I have no knowledge of any theoretical (i. e. probability) wotrk on their properties ? 2) Where could I find published datasets having been used to validate other imputation methods ? 3) Do you think that these functions should be published ? Sincerely yours, Emmanuel Charpentier PS :> Could someone offer an explanation ? Or must I recourse to sacrifying a > black goat at midnight next new moon ? > > Any hint will be appreciated (and by the way, next new moon is in about > 3 days...).The goat has endured the fear of her life, but is still alive... will probably start worshipping the Moon, however.