Dear R-list,
I'm not sure what I've found about a function in DAAG package is a bug.
When I was using cv.lm(DAAG) , I found there might be something wrong with
it. The problem is that we can't use it to deal with a linear model with
more than one predictor variable. But the usage documentation
hasn't informed us about this.
The code illustrates my discovery:
> library(DAAG)
> xx=matrix(rnorm(20*3),ncol=3)
> bb=c(1,2,0)
> yy=xx%*%bb+rnorm(20,0,10)
>
> data=data.frame(y=yy,x=xx)
> myformula=formula("y ~ x.1 + x.2 + x.3")
> cv.lm(data,myformula,plotit=F, printit=TRUE)
Analysis of Variance Table
Response: yv
Df Sum Sq Mean Sq F value Pr(>F)
xv 1 37 37 0.29 0.6
Residuals 18 2288 127
fold 1
Observations in test set: 4 6 7 9 10 19
X1 X2 X3 X4 X5 X6
x.1 -0.0316 -0.342 -1.44 1.42 -0.446 0.042
Predicted -1.6335 -0.990 1.29 -4.64 -0.773 -1.786
y -16.7876 -25.954 -14.67 -2.29 -28.118 7.731
Residual -15.1541 -24.964 -15.96 2.35 -27.344 9.517
Sum of squares = 1951 Mean square = 325 n = 6
fold 2
Observations in test set: 5 11 12 14 15 16 20
X1 X2 X3 X4 X5 X6 X7
x.1 0.472 0.282 2.20 1.75 0.253 -0.0938 0.1543
Predicted -5.089 -5.385 -2.40 -3.10 -5.431 -5.9707 -5.5842
y -5.894 -8.855 -7.32 2.88 -16.414 -3.0530 0.0434
Residual -0.805 -3.470 -4.92 5.97 -10.983 2.9177 5.6276
Sum of squares = 233 Mean square = 33.3 n = 7
fold 3
Observations in test set: 1 2 3 8 13 17 18
X1 X2 X3 X4 X5 X6 X7
x.1 0.429 1.925 0.31 -0.0194 -1.45 -0.836 0.00308
Predicted -8.592 -0.873 -9.20 -10.9030 -18.28 -15.117 -10.78682
y 11.045 -8.562 6.64 -14.6833 6.95 0.873 1.41586
Residual 19.637 -7.689 15.84 -3.7803 25.23 15.990 12.20268
Sum of squares = 1751 Mean square = 250 n = 7
Overall ms
197 ######################################################## Note the
model ("y ~ x.1 + x.2 + x.3") produces an Overall ms,
197> myformula=formula("y ~ x.1 + x.2")
> cv.lm(data,myformula,plotit=F, printit=TRUE)
Analysis of Variance Table
Response: yv
Df Sum Sq Mean Sq F value Pr(>F)
xv 1 37 37 0.29 0.6
Residuals 18 2288 127
fold 1
Observations in test set: 4 6 7 9 10 19
X1 X2 X3 X4 X5 X6
x.1 -0.0316 -0.342 -1.44 1.42 -0.446 0.042
Predicted -1.6335 -0.990 1.29 -4.64 -0.773 -1.786
y -16.7876 -25.954 -14.67 -2.29 -28.118 7.731
Residual -15.1541 -24.964 -15.96 2.35 -27.344 9.517
Sum of squares = 1951 Mean square = 325 n = 6
fold 2
Observations in test set: 5 11 12 14 15 16 20
X1 X2 X3 X4 X5 X6 X7
x.1 0.472 0.282 2.20 1.75 0.253 -0.0938 0.1543
Predicted -5.089 -5.385 -2.40 -3.10 -5.431 -5.9707 -5.5842
y -5.894 -8.855 -7.32 2.88 -16.414 -3.0530 0.0434
Residual -0.805 -3.470 -4.92 5.97 -10.983 2.9177 5.6276
Sum of squares = 233 Mean square = 33.3 n = 7
fold 3
Observations in test set: 1 2 3 8 13 17 18
X1 X2 X3 X4 X5 X6 X7
x.1 0.429 1.925 0.31 -0.0194 -1.45 -0.836 0.00308
Predicted -8.592 -0.873 -9.20 -10.9030 -18.28 -15.117 -10.78682
y 11.045 -8.562 6.64 -14.6833 6.95 0.873 1.41586
Residual 19.637 -7.689 15.84 -3.7803 25.23 15.990 12.20268
Sum of squares = 1751 Mean square = 250 n = 7
Overall ms
197 ######################################################## Note the
model ("y ~ x.1 + x.2 ") also produces an Overall ms,
197> myformula=formula("y ~ x.1 ")
> cv.lm(data,myformula,plotit=F, printit=TRUE)
Analysis of Variance Table
Response: yv
Df Sum Sq Mean Sq F value Pr(>F)
xv 1 37 37 0.29 0.6
Residuals 18 2288 127
fold 1
Observations in test set: 4 6 7 9 10 19
X1 X2 X3 X4 X5 X6
x.1 -0.0316 -0.342 -1.44 1.42 -0.446 0.042
Predicted -1.6335 -0.990 1.29 -4.64 -0.773 -1.786
y -16.7876 -25.954 -14.67 -2.29 -28.118 7.731
Residual -15.1541 -24.964 -15.96 2.35 -27.344 9.517
Sum of squares = 1951 Mean square = 325 n = 6
fold 2
Observations in test set: 5 11 12 14 15 16 20
X1 X2 X3 X4 X5 X6 X7
x.1 0.472 0.282 2.20 1.75 0.253 -0.0938 0.1543
Predicted -5.089 -5.385 -2.40 -3.10 -5.431 -5.9707 -5.5842
y -5.894 -8.855 -7.32 2.88 -16.414 -3.0530 0.0434
Residual -0.805 -3.470 -4.92 5.97 -10.983 2.9177 5.6276
Sum of squares = 233 Mean square = 33.3 n = 7
fold 3
Observations in test set: 1 2 3 8 13 17 18
X1 X2 X3 X4 X5 X6 X7
x.1 0.429 1.925 0.31 -0.0194 -1.45 -0.836 0.00308
Predicted -8.592 -0.873 -9.20 -10.9030 -18.28 -15.117 -10.78682
y 11.045 -8.562 6.64 -14.6833 6.95 0.873 1.41586
Residual 19.637 -7.689 15.84 -3.7803 25.23 15.990 12.20268
Sum of squares = 1751 Mean square = 250 n = 7
Overall ms
197 ######################################################## Note the
model ("y ~ x.1 + x.2 ") ALSO produces an Overall ms, 197
3 different linear model give three equal mss(mean squared error)!?
Then I was eager to know why 3 different linear model gave three equal
mss(mean squared error). I checked the code of function cv.lm(DAAG), then
found the residues were all derived from a model with only one predictor.
Is this a bug? Or is it because of my misunderstanding of somthing about
cv.lm(DAAG)?
Li Junjie
--
Junjie Li, klijunjie@gmail.com
Undergranduate in DEP of Tsinghua University,
[[alternative HTML version deleted]]