Dear alltogether, I experience very strange behavior of imputation of NA's with the NORM library. I use R 2.2.0, win32. The code is below and the same dataset was also tried with MICE and aregImpute() from HMISC _without_ any problem. The problem is as follows: (1) using the whole dataset results in very strange imputations - values far beyond the maximum of the respective column, > 200%! and this is reproducible and true for the whole set of imputed NAs (2) using just part (i.e. columns) of the dataset results in the fact that some NAs are not imputed at all, i.e. NAs are still in the dataset - but there is neither a warning nor an error (3) data.augmentation with da.norm() fails, but not after the first step, mostly 3-5 steps are ok, then it stops (see below) The dataset is from educational research and should be almost normal distributed (slight deviations, but not really that heavy to explain the strange results). I don't understand this, because the dataset works well with MICE and aregImpute() and other statistics _and_ I checked the manpages and it does not seem that the calls are wrong. Thus, either it depends on the dataset (but why?) or it is maybe a bug. I appreciate every help, thanks, leo g??rtler <---snip---> library(norm) rngseed(1234) load(url("http://www.anicca-vijja.de/lg/dframe.Rdata")) # load object "dframe" dim(dframe) apply(dframe,2,function(x) sum(is.na(x))) # check how many NAs in the dataset #dframe <- subset(dframe,select=-c(alter,grpzugeh,is1,is4,is6,klassenstufe,mmit,vorai,vorap,voras,vorkf,vorsg,vorvb)) s1 <- prelim.norm(dframe) s1$nmis # re-check of NAs should be identical to above s2 <- prelim.norm(dframe[,1:32])# see below -> still NAs are available - _not_ imputed thetahat1 <- em.norm(s1) theta1 <- da.norm(s1,thetahat1,steps=20,showits=TRUE) # error: # Steps of Data Augmentation: # 1...2...3...4...5...6...7...8...Fehler: NA/NaN/Inf in externem Funktionsaufruf (arg 2) thetahat2 <- em.norm(s2) ( imputed1 <- imp.norm(s1,thetahat1,dframe) ) # very strange imputed values # almost >200% to big than expected ( imputed1.1 <- imp.norm(s1,theta1,dframe) ) # not possible - because da.norm gives no result! ( imputed2 <- imp.norm(s2,thetahat2,dframe) ) # still NAs in the matrix # visualize the strange values par(mfrow=c(2,1)) hist(dframe,prob=TRUE) # histogramm data set with NAs - original values lines(density(na.omit(dframe))) hist(imputed1,prob=TRUE) # histogramm of dataset with imputed values lines(density(imputed1)) </---snip--->
You really need to send such issues to the _package_ maintainer: please see the posting guide. He will need a completely reproducible example. On Wed, 9 Nov 2005, Leo G?rtler wrote:> Dear alltogether, > > I experience very strange behavior of imputation of NA's with the NORM > library. I use R 2.2.0, win32. > The code is below and the same dataset was also tried with MICE and > aregImpute() from HMISC _without_ any problem. > The problem is as follows: > > (1) using the whole dataset results in very strange imputations - values > far beyond the maximum of the respective column, > 200%! and this is > reproducible and true for the whole set of imputed NAs > (2) using just part (i.e. columns) of the dataset results in the fact > that some NAs are not imputed at all, i.e. NAs are still in the dataset > - but there is neither a warning nor an error > (3) data.augmentation with da.norm() fails, but not after the first > step, mostly 3-5 steps are ok, then it stops (see below) > > The dataset is from educational research and should be almost normal > distributed (slight deviations, but not really that heavy to explain the > strange results). > I don't understand this, because the dataset works well with MICE and > aregImpute() and other statistics _and_ I checked the manpages and it > does not seem that the calls are wrong. > Thus, either it depends on the dataset (but why?) or it is maybe a bug. > > I appreciate every help, > > thanks, > > leo g?rtler > > <---snip---> > > library(norm) > rngseed(1234) > load(url("http://www.anicca-vijja.de/lg/dframe.Rdata")) # load object > "dframe" > dim(dframe) > apply(dframe,2,function(x) sum(is.na(x))) # check how many NAs in the > dataset > #dframe <- > subset(dframe,select=-c(alter,grpzugeh,is1,is4,is6,klassenstufe,mmit,vorai,vorap,voras,vorkf,vorsg,vorvb)) > s1 <- prelim.norm(dframe) > s1$nmis # re-check of NAs should be identical to above > s2 <- prelim.norm(dframe[,1:32])# see below -> still NAs are available - > _not_ imputed > thetahat1 <- em.norm(s1) > theta1 <- da.norm(s1,thetahat1,steps=20,showits=TRUE) # error: > # Steps of Data > Augmentation: > # > 1...2...3...4...5...6...7...8...Fehler: NA/NaN/Inf in externem > Funktionsaufruf (arg 2) > thetahat2 <- em.norm(s2) > ( imputed1 <- imp.norm(s1,thetahat1,dframe) ) # very strange imputed > values > # almost >200% to big > than expected > ( imputed1.1 <- imp.norm(s1,theta1,dframe) ) # not possible - > because da.norm gives no result! > ( imputed2 <- imp.norm(s2,thetahat2,dframe) ) # still NAs in the matrix > > # visualize the strange values > par(mfrow=c(2,1)) > hist(dframe,prob=TRUE) # histogramm data set with NAs - original values > lines(density(na.omit(dframe))) > hist(imputed1,prob=TRUE) # histogramm of dataset with imputed values > lines(density(imputed1)) > > > </---snip---> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Folks, Leo G??rther and I have been privately discussing the problems with imputation using NORM which he originally described on 9 November. Essentially, he observed that many of the imputed missing values were totally absurd, being well out of any range comatible with the observed values of the variables. After following a few false trails, we have discovered the reason. People interested in using NORM (and CAT and MIX and maybe PAN) may well be interested in this reason! The dataset, which can be downloaded from his URL http://www.anicca-vijja.de/lg/dframe.Rdata consists of a matrix with 74 columns and 200 rows. There are 553 missing values out of the 14800 (less than 4%), and the distributions of the observed values of the variables are well-behaved. So this should not be a problematic dataset. 61 of the 74 columns have missing values (NAs) in them, and this is the reason why NORM fails. Specifically, the first few lines of the code of the function prelim.norm() are as follows: if (is.vector(x)) x <- matrix(x, length(x), 1) n <- nrow(x) p <- ncol(x) storage.mode(x) <- "double" r <- 1 * is.na(x) nmis <- as.integer(apply(r, 2, sum)) names(nmis) <- dimnames(x)[[2]] mdp <- as.integer((r %*% (2^((1:ncol(x)) - 1))) + 1) and, as can be seen from the last line, if there are missing values in a column with index > 31 then (r %*% (2^((1:ncol(x)) - 1))) + 1 >= 2^31 and then applying as.integer() to this value returns NA since as.integer only works for numbers no greater than .Machine$integer.max, normally 2^31 - 1. (Is the situation different for R on say 64-bit machines?) The value of mdp[i] is a "packed" binary encoding of the column positions of any NAs in row i: if bit j-1 (counting from 0) in the binary representation of mdp[i] is 1, then there is an NA in column j of row i. The vector mdp is used at various places in the NORM routines, and the effect on the imputations of having NAs in it, when the functioning of the routines depends on unpacking the encoding, is catastrophic. (Experiment had shown, indeed, that imputing with a subset of fewer than 32 columns always gave acceptable results). The upshot of this is that NORM cannot be used for multiple imputations if there are more than 31 columns in the data which have NAs in them. You could have more than 31 columns of data -- indeed Leo's 74 would have worked then -- if the columns are re-ordered so that all the columns with NAs are at the left, provided there are fewer than 32 with NAs. Unfortunately Leo has 61. There is in principle no necessity to represent NA positions in this way, but that is how Shafer did it and it was carried over into R. An alternative method would simply be to have a 0/1 matrix of NA indicators, but the code for the NORM functions would have to be picked through to replace the unpacking of mdp -- and this includes FORTRAN routines (Oh dear, echoes of the "open source and R" discussion)! So removing this limitation would not be trivial. I have not noticed mention of the limitation in the documentation of the NORM functions. Exactly the same construction of mdp, and therefore exactly the same problem, occurs in prelim.cat in CAT, for which I'm joint maintainer with Fernando Tusell, so we had better try to look into that! Any lessons we learn will be broadcast, so should be useful for NORM as well. And, for good measure, in MIX it occurs twice over in prelim.mix: once in constructing mdpz for the continuous variables, and once in mdpw for the categorical variables. This is perhaps less likely in practice to cause the problem in MIX, since it would arise only if either there were more than 31 columns of continuous variables with NAs, or more than 31 of categorical variables; so MIXers can spread their bets. Again, I have not noticed that the limitation is mentioned in the documentation of MIX; and I'm pretty sure it is not in the documentation of CAT! Any suggestions or guidance from people who are familiar with NORM and MIX will be most welcome. I should add that I have not looked into PAN, but would not be surprised if it were there as well. I've written this explanation in consultation with Leo G??rtler, and he has proposed that I should publish it to the R List; but please consider that it is a joint effort. Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 15-Nov-05 Time: 00:45:46 ------------------------------ XFMail ------------------------------