Ritter, Christian C MCIL-CTGAS
2002-Dec-19 11:17 UTC
[R] Ongoing unhappiness with NA and factor behavior of distributed lm/predict.lm
Hi all, I''m still not quite happy with the NA and factor handling of lm and predict.lm in R1.6.1 (forcing me to use my not very skillfully crafted patches). Here is the problem 1: > print(data<-data.frame(y=c(0.9,2.05,3.02,NA,5.2),x1=c(1:4,NA),x2=factor(c("blue","blue","green","green","green"),levels=c("blue","green")))) y x1 x2 1 0.90 1 blue 2 2.05 2 blue 3 3.02 3 green 4 NA 4 green 5 5.20 NA green > fit<-lm(y~x1+x2,data=data,na.action=na.exclude) > predict(fit,data) 1 2 3 4 0.90 2.05 3.02 4.17 Interpretation: There are two NAs, one in the response and one in an explanatory variable. If I understand the action of na.exclude, I would have expected 1 2 3 4 5 0.90 2.05 3.02 4.17 NA and this is what I think it should be. (and in my personalized version of predict.lm it does this). Here is problem 2: > print(data<-data.frame(y=c(0.9,2.05,3.02,NA,5.2),x1=c(1:4,NA),x2=factor(c("blue","blue","green","green","green"),levels=c("blue","green","yellow")))) y x1 x2 1 0.90 1 blue 2 2.05 2 blue 3 3.02 3 green 4 NA 4 green 5 5.20 NA green > fit<-lm(y~x1+x2,data=data,na.action=na.exclude) > predict(fit,data) Error in model.frame.default(object, data, xlev = xlev) : factor x2 has new level(s) yellow Interpretation: Since level "yellow" was not used (is in some sense missing) in the data, predict.lm blocks. This should not happen. Maybe a warning should be given. But predict.lm should not quit with error. Here is problem 3: > print(data<-data.frame(y=c(0.9,2.05,3.02,NA,5.2),x1=c(1:4,NA),x2=factor(c("blue","blue","green","green","green"),levels=c("blue","green")))) y x1 x2 1 0.90 1 blue 2 2.05 2 blue 3 3.02 3 green 4 NA 4 green 5 5.20 NA green > fit<-lm(y~x1+x2,data=data,na.action=na.exclude) > print(newdata<-data.frame(y=c(0.9,2.05,3.02,NA,5.2),x1=c(1:4,NA),x2=factor(c("blue","blue","yellow","green","green"),levels=c("blue","green","yellow")))) y x1 x2 1 0.90 1 blue 2 2.05 2 blue 3 3.02 3 yellow 4 NA 4 green 5 5.20 NA green > predict(fit,newdata) Error in model.frame.default(object, data, xlev = xlev) : factor x2 has new level(s) yellow Interpretation: Quite naturally, predict doesn''t know what to do with a level which wasn''t used in the model. So the result should be NA. Maybe a warning should be given that there was a new level in the factor. But predict.lm should not quit with error. Fixing these problems would make exploration of residual patterns with respect to variables not included in the model much easier. Any opinion? Any help in sight? Thanks in advance, Chris. platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 1 minor 6.1 year 2002 month 11 day 01 language R Christian Ritter Functional Specialist Statistics Shell Coordination Centre S.A. Monnet Centre International Laboratory, Avenue Jean Monnet 1, B-1348 Louvain-La-Neuve, Belgium Tel: +32 10 477 349 Fax: +32 10 477 219 Email: christian.ritter@shell.com Internet: http://www.shell.com/chemicals [[alternate HTML version deleted]]