Dear R-help
This is a follow-up to my previous post here:
http://groups.google.com/group/r-help-archive/browse_thread/thread/d9b6f87ce06a9fb7/e9be30a4688f239c?lnk=gst&q=dobomode#e9be30a4688f239c
I am working on developing an open-source automated system for running
batch-regressions on very large datasets. In my previous post, I posed
the question of obtaining VIF's from the output of BIGLM. With a lot
of help from Assoc. Professor, Biostatistics Thomas Lumley at
University of Washington, I was able to make significant progress, but
ultimately got stuck. The following post describes the steps and
reasoning I undertook in trying to accomplish this task. Please note
that I am not a statistician so ignore any commentary that seems naive
to you.
A quick intro. The goal is to obtain VIF's (variance inflation
factors) from the regression output of BIGLM. Traditionally, this has
been possible with the regular lm() function. Follows a quick
illustration (the model below is pretty silly, only for illustration
purposes).
Example dataset:
> mtcars
mpg cyl disp hp drat wt qsec vs am gear
carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4
4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4
1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3
1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3
2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3
1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3
4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4
2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4
2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4
4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3
3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3
3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3
3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3
4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3
4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3
4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4
1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4
2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4
1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3
1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3
2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3
2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3
4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3
2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4
1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5
2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5
2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5
4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5
8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4
2
Example model:
model <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear +
carb
Regression:
reg_lm <- lm(model, mtcars)
Results:
> summary(reg_lm)
Call:
lm(formula = model, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
VIF's:
> vif(reg_lm)
cyl disp hp drat wt qsec
vs am gear carb
15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
4.648487 5.357452 7.908747
Here is the definition of vif() (courtesy of
http://www.stat.sc.edu/~hitchcock/bodyfatRexample.txt):
> vif.lm
function(object, ...) {
V <- summary(object)$cov.unscaled
Vi <- crossprod(model.matrix(object))
nam <- names(coef(object))
if(k <- match("(Intercept)", nam, nomatch = F)) {
v1 <- diag(V)[-k]
v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
nam <- nam[-k]
} else {
v1 <- diag(V)
v2 <- diag(Vi)
warning("No intercept term detected. Results may
surprise.")
}
structure(v1*v2, names = nam)
}
My understanding is that the function uses the unscaled variance-
covariance matrix V and the model matrix Vi to compute the VIF's based
on the equations above.
Now, attempting to do the same using biglm().
> reg_biglm <- biglm(model, mtcars)
> summary(reg_biglm)
Large data regression model: biglm(model, mtcars)
Sample size = 32
Coef (95% CI) SE p
(Intercept) 12.303 -25.132 49.739 18.718 0.511
cyl -0.111 -2.201 1.979 1.045 0.915
disp 0.013 -0.022 0.049 0.018 0.455
hp -0.021 -0.065 0.022 0.022 0.324
drat 0.787 -2.484 4.058 1.635 0.630
wt -3.715 -7.504 0.074 1.894 0.050
qsec 0.821 -0.641 2.283 0.731 0.261
vs 0.318 -3.891 4.527 2.105 0.880
am 2.520 -1.593 6.634 2.057 0.220
gear 0.655 -2.331 3.642 1.493 0.661
carb -0.199 -1.857 1.458 0.829 0.810
The challenge is to adapt the vif() function to work with biglm. I was
able to obtain the model matrix Vi using the following method.
biglm supports the calculation of the Huber/White sandwich covariance
matrix:
> reg_biglm <- biglm(model, mtcars,, TRUE)
> sandwich <- vcov(reg_biglm)
Out of this, I can extract the model-based scaled variance-covariance
matrix:
> v.scaled <- attr(vcov(reg_biglm), "model-based")
Based on my research (big thanks to Assoc. Professor, Biostatistics
Thomas Lumley at University of Washington):
sandwich = v.scaled ? model_matrix ? v.scaled
It is the model_matrix I need to obtain to compute the VIF's. I do so
by inverting v.scaled and moving them to the other side of the
equation.
> v.scaled.inverted <- solve(v.scaled)
> model_matrix <- v.scaled.inverted %*% sandwich %*% v.scaled.inverted
At this point, I have both the variance-covariance matrix and the
model matrix, so I should be able to compute the VIF's using the
function above. However, on examining the contents of the variance-
covariance matrix, I see that the one from the lm() function is
scaled, while the one I obtained from the sandwich calculations is
not:
> head(v.scaled)
(Intercept) cyl disp
hp drat wt qsec vs
am gear
(Intercept) 350.359197487 -13.163824437 -0.0059101820 -0.0266100189
-12.941827906 3.156000750 -10.546511742 3.646214479 -8.8845922019
-11.357908146
cyl -13.163824437 1.092073828 -0.0049669381 -0.0041817837
0.471219880 0.223389955 0.209717934 0.705445371 0.5605478034
0.550163881
disp -0.005910182 -0.004966938 0.0003188903 -0.0002040364
-0.003384402 -0.026026366 0.003725863 0.003764231 0.0009902057
-0.002094405
hp -0.026610019 -0.004181784 -0.0002040364 0.0004738710
0.003095272 0.009914533 0.001724661 -0.012474163 -0.0021256566
-0.002907768
drat -12.941827906 0.471219880 -0.0033844018 0.0030952720
2.674445074 0.521759369 0.043438912 -0.102380147 -0.5260445782
-0.182954006
wt 3.156000750 0.223389955 -0.0260263658 0.0099145334
0.521759369 3.588805538 -0.702017025 0.338026038 0.3675570022
0.513001888
> head(summary(reg_lm)$cov.unscaled)
(Intercept) cyl disp
hp drat wt qsec vs
am gear
(Intercept) 49.8835321876 -1.8742423910 -8.414814e-04 -3.788688e-03
-1.8426349117 0.449345889 -1.5015939690 0.5191416655 -1.2649727601
-1.6171191755
cyl -1.8742423910 0.1554875692 -7.071840e-04 -5.953951e-04
0.0670914657 0.031805873 0.0298592741 0.1004400830 0.0798098197
0.0783313749
disp -0.0008414814 -0.0007071840 4.540305e-05 -2.905034e-05
-0.0004818652 -0.003705589 0.0005304819 0.0005359447 0.0001409838
-0.0002981977
hp -0.0037886881 -0.0005953951 -2.905034e-05 6.746893e-05
0.0004406994 0.001411614 0.0002455543 -0.0017760496 -0.0003026473
-0.0004140029
drat -1.8426349117 0.0670914657 -4.818652e-04 4.406994e-04
0.3807828305 0.074287190 0.0061847567 -0.0145767070 -0.0748973106
-0.0260486727
wt 0.4493458888 0.0318058726 -3.705589e-03 1.411614e-03
0.0742871900 0.510967881 -0.0999519610 0.0481275585 0.0523321257
0.0730403152
They differ by a constant factor:
> head(v.scaled) / head(summary(reg_lm)$cov.unscaled)
(Intercept) cyl disp hp drat
wt qsec vs am gear carb
(Intercept) 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
cyl 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
disp 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
hp 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
drat 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
wt 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
This factor turns out to be the squared value of the regression's
sigma:
> summary(reg_lm)$sigma ^ 2
[1] 7.023544
Sigma is defined as:
the square root of the estimated variance of the random error
sigma^2 = 1/(n-p) Sum(w[i] R[i]^2),
where R[i] is the i-th residual, residuals[i].
The ultimate challenge is that sigma is returned by lm(), but not by
biglm(). To calcualte sigma, it looks like I would need to compute
the residuals for the entire dataset which defeats the purpose of biglm
(). Does anybody know if there is an easier way of obtaining sigma?
Any guidance on this subject would be greatly appreciated!