Disclaimer : Short of having local statistical expertise at hand, I'm using
this list because I use R for variable selection in the context of linear
multiple regression but the questions I have relate more to basic statistics
than to R per se. Please redirect me to another appropriate list if such a
list exists.
I have the very common problem of identifying which (subset of) variables
are important in a multiple linear regression problem. Googling and
browsing around, I ended up with four methods I could easily access through
R packages. My aim was to use those methods with continuous variables in a
bid to find out which had an effect on my dependent variable, but more so I
thought this would lead me to identify variables amenable to categorization.
Example : I have population as a continuous variable so I figured if some
variable selection algorithm flagged population as "significant", then
I
could try and apply a priori knowledge of the classes of population ranges I
suspect could show different behaviours in the form of statistically
different means as per ANOVA analysis. Coming back to my original variable
selection need, here's what I did.
First : regular lm
Second : step
Third : all-subsets (regsubsets, package leaps)
Fourth : lasso (l1ce, package lasso2)
Fifth : (I meant to use the lars package, but it does not allow for
formulas; I know I could cast my dataset as matrices, but I didn't find an
easy way of doing this and I figured I had enough options already)
I'm trying to make sense of the information that is sent back at me from the
summary calls. I'm looking for what variables are identified by each
method, hoping to find comparable/complementary results. Given I'm a
transient user of statistics, I find many of the "Details" sections on
the
help files lack specific instructions as to exactly how to interpret the
results. Here we go
*lm*
>
anova(tonlm)
Analysis of Variance Table
Response: Cout.ton
Df Sum Sq Mean Sq F value Pr(>F)
Tonnage 1 9720 9720 1.3497 0.2470437
Popul 1 112361 112361 15.6014 0.0001164 ***
DensiteOcc 1 173350 173350 24.0699 2.245e-06 ***
NbmRues.hab 1 280 280 0.0389 0.8438903
RFU.hab 1 183 183 0.0254 0.8734816
UO.MAMR.Precis 1 67161 67161 9.3254 0.0026428 **
t.CS.t.déchets.MAMR 1 24188 24188 3.3586 0.0686925 .
Pct.Pot.CS.hab 1 78764 78764 10.9365 0.0011614 **
NbCentresTriDs100km 1 218725 218725 30.3702 1.380e-07 ***
DistMarche 1 54114 54114 7.5137 0.0068110 **
Residuals 162 1166717 7202
*ANOVA applied on step(lm) *
Response: Cout.ton
Df Sum Sq Mean Sq F value Pr(>F)
Popul 1 2917 2917 0.4038 0.525990
DensiteOcc 1 163741 163741 22.6730 4.179e-06 ***
RFU.hab 1 83 83 0.0115 0.914899
UO.MAMR.Precis 1 168407 168407 23.3191 3.112e-06 ***
Pct.Pot.CS.hab 1 122685 122685 16.9880 5.943e-05 ***
NbCentresTriDs100km 1 190659 190659 26.4002 7.761e-07 ***
DistMarche 1 65467 65467 9.0652 0.003015 **
Residuals 165 1191606 7222
Questions compared to lm above : What tells me which variable was selected
first in the stepwise process ? Do I sort Pr(>F), the lowest value of which
corresponds to the first variable ? Popul has a 3 star rating in lm and
nothing in step. How do I interpret that ?
* all-subsets regression *
> summary(tonall)$cp
[1] 37.338138 31.452375 25.300272 17.965950 13.751043 10.021910 8.455827
8.810498 9.273599 11.000000> summary(tonall)$adjr2
[1] 0.2155976 0.2411378 0.2680048 0.2997661 0.3197652 0.3381030 0.3481410
0.3506880 0.3528339 0.3499369
> summary(tonall)$which[1,][summary(tonall)$which[1,]]
(Intercept) DistMarche > summary(tonall)$which[2,][summary(tonall)$which[2,]]
(Intercept), NbCentresTriDs100km, DistMarche > summary(tonall)$which[3,][summary(tonall)$which[3,]]
(Intercept), Tonnage, UO.MAMR.Precis, NbCentresTriDs100km
> summary(tonall)$which[4,][summary(tonall)$which[4,]]
(Intercept), Tonnage, DensiteOcc, UO.MAMR.Precis, NbCentresTriDs100km
> summary(tonall)$which[5,][summary(tonall)$which[5,]]
(Intercept), Tonnage, DensiteOcc, UO.MAMR.Precis NbCentresTriDs100km,
DistMarche
omitting the remaining values up to which[10,]
Questions w/r to lm and step : all-subsets says DistMarche is the single
most important. That makes some sense as that variable had a two-star
rating in both lm and step. But shouldn't the 3-star ratings in step be
close to those in all-subsets ? For example, the best 3-variable model
shows Tonnage popping up. Tonnage has no rating in lm and doesn't even show
in step. Is it a matter of step being initialized with a variable such that
Tonnage will never be considered whereas it is in an exhaustive all-subsets
regression ? I'm puzzled.
* lasso *
> summary(tonlasso)
...
Coefficients:
Value Std. Error Z score Pr(>|Z|)
(Intercept) 234.5910 36.0120 6.5142 0.0000
Tonnage -0.0155 0.0065 -2.3736 0.0176
Popul -0.0001 0.0008 -0.1639 0.8698
DensiteOcc -0.0719 0.0201 -3.5732 0.0004
NbmRues.hab -0.0130 0.0110 -1.1820 0.2372
RFU.hab 0.0005 0.0002 2.6586 0.0078
UO.MAMR.Precis 0.0025 0.0013 1.9025 0.0571
t.CS.t.déchets.MAMR -25.9682 51.2912 -0.5063 0.6127
Pct.Pot.CS.hab -69.0914 35.9212 -1.9234 0.0544
NbCentresTriDs100km -5.9783 2.3811 -2.5108 0.0120
DistMarche 0.2210 0.0805 2.7461 0.0060
I read that LASSO effectively allows for variable selection in the form of
coefficients being set to 0. With the figures above, can I fix Values < 0.1
or 0.01 as 0, which would eliminate a number of variates ? Does a
decreasing order of (absolute) coefficient values amount to determining the
order of selection of variables according to LASSO ?
Thank you for your patience and for pointers.
Yves Moisan
--
View this message in context:
http://www.nabble.com/Variable-selection-in-R-tf4556775.html#a13004728
Sent from the R help mailing list archive at Nabble.com.
[[alternative HTML version deleted]]