Dear alltogether,
I experience very strange behavior of imputation of NA's with the NORM
library. I use R 2.2.0, win32.
The code is below and the same dataset was also tried with MICE and
aregImpute() from HMISC _without_ any problem.
The problem is as follows:
(1) using the whole dataset results in very strange imputations - values
far beyond the maximum of the respective column, > 200%! and this is
reproducible and true for the whole set of imputed NAs
(2) using just part (i.e. columns) of the dataset results in the fact
that some NAs are not imputed at all, i.e. NAs are still in the dataset
- but there is neither a warning nor an error
(3) data.augmentation with da.norm() fails, but not after the first
step, mostly 3-5 steps are ok, then it stops (see below)
The dataset is from educational research and should be almost normal
distributed (slight deviations, but not really that heavy to explain the
strange results).
I don't understand this, because the dataset works well with MICE and
aregImpute() and other statistics _and_ I checked the manpages and it
does not seem that the calls are wrong.
Thus, either it depends on the dataset (but why?) or it is maybe a bug.
I appreciate every help,
thanks,
leo g??rtler
<---snip--->
library(norm)
rngseed(1234)
load(url("http://www.anicca-vijja.de/lg/dframe.Rdata")) # load
object
"dframe"
dim(dframe)
apply(dframe,2,function(x) sum(is.na(x))) # check how many NAs in the
dataset
#dframe <-
subset(dframe,select=-c(alter,grpzugeh,is1,is4,is6,klassenstufe,mmit,vorai,vorap,voras,vorkf,vorsg,vorvb))
s1 <- prelim.norm(dframe)
s1$nmis # re-check of NAs should be identical to above
s2 <- prelim.norm(dframe[,1:32])# see below -> still NAs are available -
_not_ imputed
thetahat1 <- em.norm(s1)
theta1 <- da.norm(s1,thetahat1,steps=20,showits=TRUE) # error:
# Steps of Data
Augmentation:
#
1...2...3...4...5...6...7...8...Fehler: NA/NaN/Inf in externem
Funktionsaufruf (arg 2)
thetahat2 <- em.norm(s2)
( imputed1 <- imp.norm(s1,thetahat1,dframe) ) # very strange imputed
values
# almost >200% to big
than expected
( imputed1.1 <- imp.norm(s1,theta1,dframe) ) # not possible -
because da.norm gives no result!
( imputed2 <- imp.norm(s2,thetahat2,dframe) ) # still NAs in the matrix
# visualize the strange values
par(mfrow=c(2,1))
hist(dframe,prob=TRUE) # histogramm data set with NAs - original values
lines(density(na.omit(dframe)))
hist(imputed1,prob=TRUE) # histogramm of dataset with imputed values
lines(density(imputed1))
</---snip--->
You really need to send such issues to the _package_ maintainer: please see the posting guide. He will need a completely reproducible example. On Wed, 9 Nov 2005, Leo G?rtler wrote:> Dear alltogether, > > I experience very strange behavior of imputation of NA's with the NORM > library. I use R 2.2.0, win32. > The code is below and the same dataset was also tried with MICE and > aregImpute() from HMISC _without_ any problem. > The problem is as follows: > > (1) using the whole dataset results in very strange imputations - values > far beyond the maximum of the respective column, > 200%! and this is > reproducible and true for the whole set of imputed NAs > (2) using just part (i.e. columns) of the dataset results in the fact > that some NAs are not imputed at all, i.e. NAs are still in the dataset > - but there is neither a warning nor an error > (3) data.augmentation with da.norm() fails, but not after the first > step, mostly 3-5 steps are ok, then it stops (see below) > > The dataset is from educational research and should be almost normal > distributed (slight deviations, but not really that heavy to explain the > strange results). > I don't understand this, because the dataset works well with MICE and > aregImpute() and other statistics _and_ I checked the manpages and it > does not seem that the calls are wrong. > Thus, either it depends on the dataset (but why?) or it is maybe a bug. > > I appreciate every help, > > thanks, > > leo g?rtler > > <---snip---> > > library(norm) > rngseed(1234) > load(url("http://www.anicca-vijja.de/lg/dframe.Rdata")) # load object > "dframe" > dim(dframe) > apply(dframe,2,function(x) sum(is.na(x))) # check how many NAs in the > dataset > #dframe <- > subset(dframe,select=-c(alter,grpzugeh,is1,is4,is6,klassenstufe,mmit,vorai,vorap,voras,vorkf,vorsg,vorvb)) > s1 <- prelim.norm(dframe) > s1$nmis # re-check of NAs should be identical to above > s2 <- prelim.norm(dframe[,1:32])# see below -> still NAs are available - > _not_ imputed > thetahat1 <- em.norm(s1) > theta1 <- da.norm(s1,thetahat1,steps=20,showits=TRUE) # error: > # Steps of Data > Augmentation: > # > 1...2...3...4...5...6...7...8...Fehler: NA/NaN/Inf in externem > Funktionsaufruf (arg 2) > thetahat2 <- em.norm(s2) > ( imputed1 <- imp.norm(s1,thetahat1,dframe) ) # very strange imputed > values > # almost >200% to big > than expected > ( imputed1.1 <- imp.norm(s1,theta1,dframe) ) # not possible - > because da.norm gives no result! > ( imputed2 <- imp.norm(s2,thetahat2,dframe) ) # still NAs in the matrix > > # visualize the strange values > par(mfrow=c(2,1)) > hist(dframe,prob=TRUE) # histogramm data set with NAs - original values > lines(density(na.omit(dframe))) > hist(imputed1,prob=TRUE) # histogramm of dataset with imputed values > lines(density(imputed1)) > > > </---snip---> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Folks,
Leo G??rther and I have been privately discussing the
problems with imputation using NORM which he originally
described on 9 November. Essentially, he observed that
many of the imputed missing values were totally absurd,
being well out of any range comatible with the observed
values of the variables.
After following a few false trails, we have discovered
the reason. People interested in using NORM (and CAT and
MIX and maybe PAN) may well be interested in this reason!
The dataset, which can be downloaded from his URL
http://www.anicca-vijja.de/lg/dframe.Rdata
consists of a matrix with 74 columns and 200 rows.
There are 553 missing values out of the 14800 (less
than 4%), and the distributions of the observed values
of the variables are well-behaved. So this should not
be a problematic dataset.
61 of the 74 columns have missing values (NAs) in them,
and this is the reason why NORM fails.
Specifically, the first few lines of the code of the
function prelim.norm() are as follows:
if (is.vector(x))
x <- matrix(x, length(x), 1)
n <- nrow(x)
p <- ncol(x)
storage.mode(x) <- "double"
r <- 1 * is.na(x)
nmis <- as.integer(apply(r, 2, sum))
names(nmis) <- dimnames(x)[[2]]
mdp <- as.integer((r %*% (2^((1:ncol(x)) - 1))) + 1)
and, as can be seen from the last line, if there are
missing values in a column with index > 31 then
(r %*% (2^((1:ncol(x)) - 1))) + 1 >= 2^31
and then applying as.integer() to this value returns NA
since as.integer only works for numbers no greater than
.Machine$integer.max, normally 2^31 - 1. (Is the situation
different for R on say 64-bit machines?)
The value of mdp[i] is a "packed" binary encoding of the
column positions of any NAs in row i: if bit j-1 (counting
from 0) in the binary representation of mdp[i] is 1, then
there is an NA in column j of row i.
The vector mdp is used at various places in the NORM routines,
and the effect on the imputations of having NAs in it, when
the functioning of the routines depends on unpacking the
encoding, is catastrophic. (Experiment had shown, indeed,
that imputing with a subset of fewer than 32 columns always
gave acceptable results).
The upshot of this is that NORM cannot be used for multiple
imputations if there are more than 31 columns in the data
which have NAs in them.
You could have more than 31 columns of data -- indeed Leo's
74 would have worked then -- if the columns are re-ordered
so that all the columns with NAs are at the left, provided
there are fewer than 32 with NAs. Unfortunately Leo has 61.
There is in principle no necessity to represent NA positions
in this way, but that is how Shafer did it and it was carried
over into R. An alternative method would simply be to have
a 0/1 matrix of NA indicators, but the code for the NORM
functions would have to be picked through to replace the
unpacking of mdp -- and this includes FORTRAN routines
(Oh dear, echoes of the "open source and R" discussion)!
So removing this limitation would not be trivial.
I have not noticed mention of the limitation in the documentation
of the NORM functions.
Exactly the same construction of mdp, and therefore exactly the
same problem, occurs in prelim.cat in CAT, for which I'm joint
maintainer with Fernando Tusell, so we had better try to look
into that! Any lessons we learn will be broadcast, so should be
useful for NORM as well.
And, for good measure, in MIX it occurs twice over in prelim.mix:
once in constructing mdpz for the continuous variables, and
once in mdpw for the categorical variables. This is perhaps
less likely in practice to cause the problem in MIX, since it
would arise only if either there were more than 31 columns of
continuous variables with NAs, or more than 31 of categorical
variables; so MIXers can spread their bets.
Again, I have not noticed that the limitation is mentioned
in the documentation of MIX; and I'm pretty sure it is not
in the documentation of CAT!
Any suggestions or guidance from people who are familiar with
NORM and MIX will be most welcome.
I should add that I have not looked into PAN, but would not
be surprised if it were there as well.
I've written this explanation in consultation with Leo G??rtler,
and he has proposed that I should publish it to the R List;
but please consider that it is a joint effort.
Best wishes to all,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 15-Nov-05 Time: 00:45:46
------------------------------ XFMail ------------------------------