I'm using the package 'lars' in R with the following code:
> library(lars)
> set.seed(3)
> n <- 1000
> x1 <- rnorm(n)
> x2 <- x1+rnorm(n)*0.5
> x3 <- rnorm(n)
> x4 <- rnorm(n)
> x5 <- rexp(n)
> y <- 5*x1 + 4*x2 + 2*x3 + 7*x4 + rnorm(n)
> x <- cbind(x1,x2,x3,x4,x5)
> cor(cbind(y,x))
            y          x1           x2           x3          x4          x5
y  1.00000000  0.74678534  0.743536093  0.210757777  0.59218321  0.03943133
x1 0.74678534  1.00000000  0.892113559  0.015302566 -0.03040464  0.04952222
x2 0.74353609  0.89211356  1.000000000 -0.003146131 -0.02172854  0.05703270
x3 0.21075778  0.01530257 -0.003146131  1.000000000  0.05437726  0.01449142
x4 0.59218321 -0.03040464 -0.021728535  0.054377256  1.00000000 -0.02166716
x5 0.03943133  0.04952222  0.057032700  0.014491422 -0.02166716 
1.00000000> m <- lars(x,y,"step",trace=T)
Forward Stepwise sequence
Computing X'X .....
LARS Step 1 :    Variable 1     added
LARS Step 2 :    Variable 4     added
LARS Step 3 :    Variable 3     added
LARS Step 4 :    Variable 2     added
LARS Step 5 :    Variable 5     added
Computing residuals, RSS etc .....
I've got a dataset with 5 continuous variables and I'm trying to fit a
model
to a single (dependent) variable y. Two of my predictors are highly
correlated with each other (x1, x2).
As you can see in the above example the lars function with 'stepwise'
option
first chooses the variable that is most correlated with y. The next variable
to enter the model is the one that is most correlated with the residuals.
Indeed, it is x4:
> round((cor(cbind(resid(lm(y~x1)),x))[1,3:6]),4)
    x2     x3     x4     x5 
0.1163 0.2997 0.9246 0.0037  
Now, if I do the 'lasso' option:
> m <- lars(x,y,"lasso",trace=T)
LASSO sequence
Computing X'X ....
LARS Step 1 :    Variable 1     added
LARS Step 2 :    Variable 2     added
LARS Step 3 :    Variable 4     added
LARS Step 4 :    Variable 3     added
LARS Step 5 :    Variable 5     added
It adds both of the correlated variables to the model in the first two
steps. This is the opposite from what I read in several papers. Most of then
say that if there is a group of variables among which the correlations are
very high, then the 'lasso' tends to select only one variable from the
group
at random.
Can someone provide an example of this behavior? Or explain, why my
variables x1, x2 are added to the model one after another (together) ?
--
View this message in context:
http://r.789695.n4.nabble.com/Selecting-correlated-predictors-with-LASSO-tp4633586.html
Sent from the R help mailing list archive at Nabble.com.