Sicotte, Hugues Ph.D.
2006-Nov-01 17:37 UTC
[R] Fitting mean and covariance of Multivariate normal with censored data
Hello, I have been googling for 2 days and I cannot find the answer in previous posts. I have a set of d-dimensional data elements (d=11 .. 14), each data point can be censored at different values both Lower-limit and upper limit. N = 2000 sets of vectors of D=11 data points per vector. Each of the N*D points can have different upper and lower limits. I "simply" want to fit a Multivariate distribution to the data and get the mean and covariance matrix of the sample. The closest post I saw mentionned the function survreg in the survival5 package, but I can't figure out how To use it for a non-regression model. Thanks, Hugues [[alternative HTML version deleted]]
Sicotte, Hugues Ph.D.
2006-Nov-13 22:42 UTC
[R] Fitting mean and covariance of Multivariate normal with censored data
Nobody answered, but this is what I did. I used an iterative poor man's imputation and filled in the data with a single value at each iteration. My covariance matrices and variance will be a little underestimated, but it'll do for me at this stage of this project. I iteratively filled in the data using the current data to estimate the parameters of the normal distribution. Using the current estimates, I figured out how much of the current distribution is outside each cut-off. Then fill-in at the mid-point of the probability density that is outside the area. I limited my corrections to 3*sigma. Here is the code section inside the loop (I am not showing another part of the code that does outlier elimination, nor the rest of the loop that recomputes the covariance and the parameters of the normal distribution) Trim the data to only use "lines" with at least 60% real data. Trim the columns to those that have enough data to estimate the parameters of the normal distribution (e.g. ~ 20) Loop over iter iterations ( from 1 to iterlast) Compute mean, variance, and co-variance matrix. Code to detect outliers.. Only activated after "iterlast" iterations Sort and trim worst outliers (left and right, up to 10% of data) For Each "line" of data, find out which elements were missing in original data set. for(im in missings) { # Compute the quantile on the cutoff away from N(mode,variance) for both cutoff quantOver<-pnorm(c(highR),mean=mu[im],sd=sigmaMu[im]) quantUnder<-pnorm(c(lowR),mean=mu[im],sd=sigmaMu[im]) # Find out which end of the distribution is the most outside of the limit quantOverMid<- (1.0-quantOver)/2 quantUnderMid<- quantUnder/2 # correct the mean, but underestimate the variance a little bit in cases # where width of pattern> width of cutoff-region fixUp<-mu[im] fixLow<-mu[im] # Don't fix if distribution is cutoff at edges. if(quantOverMid<0.05) { quantOverMid<-0.0 } else { fixUp<-qnorm(c(1.0-quantOverMid),mean=mu[im],sd=sigmaMu[im]) } if(quantUnderMid<0.05) { quantUnderMid<-0.0 } else { fixLow<-qnorm(c(quantUnderMid),mean=mu[im],sd=sigmaMu[im]) } # Censored variables "pull" from both sides. quantNormSum <- (quantOverMid+quantUnderMid) if(quantNormSum>=0.05) { quantNormSum<- 1.0/quantNormSum fix <- (fixUp*quantOverMid+fixLow*quantUnderMid)*quantNormSum } else { # If missing data is in the tails, correct by cutoff. if(quantNormSum>1e-10) { fix<- (highR*(1.0-quantOver)+lowR*quantUnder)/((1.0-quantOver)+quantUnder) } else { fix<-mu[im] } } if(!is.finite(fix)) { if(1.0-quantOver > quantUnder) { fix<-mu[im]+3*sigmaMu[im] } else { fix<- mu[im]-3*sigmaMu[im] } } else { if(fix>mu[im]+3*sigmaMu[im]) { fix<-mu[im]+3*sigmaMu[im] } else if (fix< mu[im]-3*sigmaMu[im]) { fix<- mu[im]-3*sigmaMu[im] } }> _____________________________________________ > From: Sicotte, Hugues Ph.D. > Sent: Wednesday, November 01, 2006 11:38 AM > To: 'r-help@stat.math.ethz.ch' > Subject: Fitting mean and covariance of Multivariate normal with > censored data > > Hello, > I have been googling for 2 days and I cannot find the answer > in previous posts. > > I have a set of d-dimensional data elements (d=11 .. 14), each data > point can be censored at different values both > Lower-limit and upper limit. > > N = 2000 sets of vectors of > D=11 data points per vector. > Each of the N*D points can have different upper and lower limits. > > I "simply" want to fit a Multivariate distribution to the data and get > the mean and covariance matrix of the sample. > > The closest post I saw mentionned the function survreg in the > survival5 package, but I can't figure out how > To use it for a non-regression model. > > Thanks, > Hugues >[[alternative HTML version deleted]]