Fabian Scheipl
2006-Jul-20 21:09 UTC
[R] Loss of numerical precision from conversion to list ?
I?m working on an R-implementation of the simulation-based finite-sample null-distribution of (R)LR-Test in Mixed Models (i.e. testing for Var(RandomEffect)=0) derived by C. M. Crainiceanu and D. Ruppert. I'm in the beginning stages of this project and while comparing quick and dirty grid-search-methods and more exact optim()/optimize()-based methods to find the maximum of a part of the RLR-Test-Statistic i stumbled upon the following problem: It seems to me that R produces different results depending on whether originally identical numbers involved in the exact same computations are read from a matrix or a list. (I need both in order to do quick vectorized computation for the grid-search with matrices and "list-based" computation so that i can put the function to be maximized in something like mapply(...,optim(foo),...)- I can elaborate if desired) However, the problem goes away once a number involved in the computation is set from almost zero (e-15) to 4. I'm completely mystified by this; especially since this number that I change is NOT one of the numbers that are switched from matrix to list. Here's the code: library(nlme) data(Orthodont) #108 dental measurements on 27 subjects # m1<-lme(distance~age,random=~1|Subject,data=Orthodont) # summary(m1) # ... # Random effects: # Formula: ~1 | Subject # (Intercept) Residual # StdDev: 2.114724 1.431592 -> lambda.REML=2.114^2/1.431^2 = 2.182382 #DesignMatrix for fixed Effects X<-cbind(rep(1,108),Orthodont$age) #DesignMatrix of RandomEffects Z<-matrix(data=c(rep(1,4),rep(0,108)),nrow=108,ncol=27) #Corr(RanEf)^0.5 = 27 x 27 Identity, since RandomIntercepts are independent sqrt.Sigma<-diag(27) K<-27 #number of subjects/ random intercepts n<-nrow(X) p<-ncol(X) lambda0 <- 2.182382 #actually not a sensible choice as Null-Hypothesis, but that doesn't pertain to the problem #Projection-Matrix for Fixed-Effects-Model: Y -> errors P0=diag(n)-X%*%solve((t(X)%*%X))%*%t(X) mu<-eigen(sqrt.Sigma%*%t(Z)%*%P0%*%Z%*%sqrt.Sigma)$values # mu # [1] 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 #[11] 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 #[21] 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 5.77316e-15 # ! Notice the last (27th) value very close to 0 nsim<-10 set.seed(10) #nsim x K array of ChiSq(1)-variates w.k.sq.mat<-matrix(rchisq(nsim*K,1),nrow=nsim) #nsim x 1 array of ChiSq(n-p-K)-variates w.sum2<-rchisq(nsim,n-p-K) ### vectorized computation of nsim=10 realizations ### of a part of the RLR-statistic under the Null: w.k.sq<- cbind(w.k.sq.mat,w.sum2) #nsim x (K+1) #vector-based results: num.v<- rowSums(((lambda-lambda0)*mu*w.k.sq[,-(K+1)])/(1+lambda*mu)) den.v<- rowSums(((1+lambda0*mu)*w.k.sq[,-(K+1)]) / (1+lambda*mu)) + w.k.sq[,K+1] ### list-based computation of nsim=10 realizations ### of a part of the RLR-statistic under the Null: w.k.sq<-list() length(w.k.sq)<-nsim #put the nsim rows into list-slots: for(i in 1:nsim) w.k.sq[[i]]<-c(w.k.sq.mat[i,],w.sum2[i]) num.l<-numeric(0) den.l<-numeric(0) for(i in 1:nsim) { num.l[i]<-sum(((lambda-lambda0)*mu*w.k.sq[[i]][-(K+1)])/(1+lambda*mu)) #exactly analogous to num.v & den.v, except list-elements instead of vector den.l[i]<-sum(((1+lambda0*mu)*w.k.sq[[i]][-(K+1)]) / (1+lambda*mu)) + w.k.sq[[i]][K+1] } # Now the actual problem: # notice the discrepancies between the results from vectorized computation # and the results from list-based computation # Since discrepancies disappear if mu[27] is changed # from 5.77316e-15 to 4, i'm guessing somewhere in the conversion to # "list" there must be a loss of precision or is there an entirely # different problem? num.l # [1] -25.93322 -17.65486 -18.80239 -19.49974 .... num.v # [1] -23.84733 -17.62233 -27.22975 -19.50294 .... den.l # [1] 117.30246 92.59041 92.91491 112.90113 ... den.v # [1] 115.21657 92.55789 101.34228 112.90433 ... #now i set mu[27]<-4 #and reran the computation of num.l /.v and den.l /.v from above: num.l # [1] -26.25565 -17.67423 -27.47259 -20.97961 ... num.v # [1] -26.25565 -17.67423 -27.47259 -20.97961 ... den.l # [1] 117.62489 92.60979 101.58511 114.38100 ... den.v # [1] 117.62489 92.60979 101.58511 114.38100 ... what i would like to know now is: 1) which of the two calculations yields a more precise result? or rather: 2) how can i avoid these discrepancies in the future since i need to be able to compare these two methods? and, most importantly, 3) what in R.A.Fisher's name is happening here? version information: Version 2.3.1 (2006-06-01) i386-pc-mingw32 .Machine$double.eps is 2.220446e-16 (does it matter?) thanks for your time, -- Fabian Scheipl f.abian at gmx.net "Feel free" ? 10 GB Mailbox, 100 FreeSMS/Monat ...
Duncan Murdoch
2006-Jul-20 23:07 UTC
[R] Loss of numerical precision from conversion to list ?
R tries to use the maximum precision (64 bit mantissa) in the floating point registers when it can. When it stores results to memory, they are stored in double precision (53 bit mantissa). There's unlikely to be anything specific about conversion to a list that lost the precision, but I'm guessing one version of your code stored things to memory, whereas the other kept intermediate results in registers. Using the maximum precision is a somewhat controversial choice: there's an argument that it's best to get consistent results, even if they're wrong. R has chosen to try to do the best it can, even if it means sometimes it is inconsistent. Another argument is that algorithms that depend strongly on values beyond the first 53 bits are very unstable, and should be replaced by more stable ones that don't inflate small errors. Or perhaps your problem has nothing to do with this; I didn't really look at it in detail. Duncan Murdoch On 7/20/2006 5:09 PM, Fabian Scheipl wrote:> I?m working on an R-implementation of the simulation-based finite-sample null-distribution of (R)LR-Test in Mixed Models (i.e. testing for Var(RandomEffect)=0) derived by C. M. Crainiceanu and D. Ruppert. > > I'm in the beginning stages of this project and while comparing quick and dirty grid-search-methods and more exact optim()/optimize()-based methods to find the maximum of a part of the RLR-Test-Statistic i stumbled upon the following problem: > > It seems to me that R produces different results depending on whether originally identical numbers involved in the exact same computations are read from a matrix or a list. > (I need both in order to do quick vectorized computation for the grid-search with matrices and "list-based" computation so that i can put the function to be maximized in something like mapply(...,optim(foo),...)- I can elaborate if desired) > > However, the problem goes away once a number involved in the computation is set from almost zero (e-15) to 4. > I'm completely mystified by this; especially since this number that I change is NOT one of the numbers that are switched from matrix to list. > > Here's the code: > > library(nlme) > data(Orthodont) #108 dental measurements on 27 subjects > # m1<-lme(distance~age,random=~1|Subject,data=Orthodont) > # summary(m1) > # ... > # Random effects: > # Formula: ~1 | Subject > # (Intercept) Residual > # StdDev: 2.114724 1.431592 -> lambda.REML=2.114^2/1.431^2 = 2.182382 > > #DesignMatrix for fixed Effects > X<-cbind(rep(1,108),Orthodont$age) > #DesignMatrix of RandomEffects > Z<-matrix(data=c(rep(1,4),rep(0,108)),nrow=108,ncol=27) > > #Corr(RanEf)^0.5 = 27 x 27 Identity, since RandomIntercepts are independent > sqrt.Sigma<-diag(27) > > K<-27 #number of subjects/ random intercepts > n<-nrow(X) > p<-ncol(X) > lambda0 <- 2.182382 #actually not a sensible choice as Null-Hypothesis, but that doesn't pertain to the problem > > #Projection-Matrix for Fixed-Effects-Model: Y -> errors > P0=diag(n)-X%*%solve((t(X)%*%X))%*%t(X) > > mu<-eigen(sqrt.Sigma%*%t(Z)%*%P0%*%Z%*%sqrt.Sigma)$values > # mu > # [1] 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 > #[11] 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 > #[21] 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 4.00000e+00 5.77316e-15 > # ! Notice the last (27th) value very close to 0 > > nsim<-10 > set.seed(10) > #nsim x K array of ChiSq(1)-variates > w.k.sq.mat<-matrix(rchisq(nsim*K,1),nrow=nsim) > #nsim x 1 array of ChiSq(n-p-K)-variates > w.sum2<-rchisq(nsim,n-p-K) > > ### vectorized computation of nsim=10 realizations > ### of a part of the RLR-statistic under the Null: > w.k.sq<- cbind(w.k.sq.mat,w.sum2) #nsim x (K+1) > #vector-based results: > num.v<- rowSums(((lambda-lambda0)*mu*w.k.sq[,-(K+1)])/(1+lambda*mu)) > den.v<- rowSums(((1+lambda0*mu)*w.k.sq[,-(K+1)]) / (1+lambda*mu)) + w.k.sq[,K+1] > > ### list-based computation of nsim=10 realizations > ### of a part of the RLR-statistic under the Null: > w.k.sq<-list() > length(w.k.sq)<-nsim > #put the nsim rows into list-slots: > for(i in 1:nsim) w.k.sq[[i]]<-c(w.k.sq.mat[i,],w.sum2[i]) > num.l<-numeric(0) > den.l<-numeric(0) > for(i in 1:nsim) > { > num.l[i]<-sum(((lambda-lambda0)*mu*w.k.sq[[i]][-(K+1)])/(1+lambda*mu)) > #exactly analogous to num.v & den.v, except list-elements instead of vector > den.l[i]<-sum(((1+lambda0*mu)*w.k.sq[[i]][-(K+1)]) / (1+lambda*mu)) + w.k.sq[[i]][K+1] > } > > # Now the actual problem: > # notice the discrepancies between the results from vectorized computation > # and the results from list-based computation > # Since discrepancies disappear if mu[27] is changed > # from 5.77316e-15 to 4, i'm guessing somewhere in the conversion to > # "list" there must be a loss of precision or is there an entirely > # different problem? > > > num.l > # [1] -25.93322 -17.65486 -18.80239 -19.49974 .... > num.v > # [1] -23.84733 -17.62233 -27.22975 -19.50294 .... > > den.l > # [1] 117.30246 92.59041 92.91491 112.90113 ... > den.v > # [1] 115.21657 92.55789 101.34228 112.90433 ... > > #now i set > mu[27]<-4 > #and reran the computation of num.l /.v and den.l /.v from above: > > num.l > # [1] -26.25565 -17.67423 -27.47259 -20.97961 ... > num.v > # [1] -26.25565 -17.67423 -27.47259 -20.97961 ... > den.l > # [1] 117.62489 92.60979 101.58511 114.38100 ... > den.v > # [1] 117.62489 92.60979 101.58511 114.38100 ... > > what i would like to know now is: > > 1) which of the two calculations yields a more precise result? > or rather: > 2) how can i avoid these discrepancies in the future since i need to be able to compare these two methods? > and, most importantly, > 3) what in R.A.Fisher's name is happening here? > > version information: > > Version 2.3.1 (2006-06-01) > i386-pc-mingw32 > .Machine$double.eps is 2.220446e-16 (does it matter?) > > thanks for your time, > > >