Hi,
I am new to model selection by coefficient shrinkage
method such as lasso. And I became particularly
interested in variable selection in Cox regression by
lasso. I became aware of the coxpath() in R package
glmpath does lasso on Cox model. I have tried the
sample script on the help page of coxpath(), but I
have difficult time understanding the output.
Therefore, I would greatly appreciate if anyone can
help me understand how to use the function.
> data(lung.data)
> attach(lung.data)
> fit.a <- coxpath(lung.data)
> print(fit.a)
Call:
coxpath(data = lung.data)
Step 1 : karno
Step 2 : celltype
Step 5 : trt
Step 6 : prior
Step 7 : age
Step 8 : diagtime
> summary(fit.a)
Call:
coxpath(data = lung.data)
Df Log.p.lik AIC BIC
Step 1 0 -505.8840 1011.7679 1011.7679
Step 2 1 -486.0691 974.1382 977.0581
Step 5 2 -484.8520 973.7040 979.5440
Step 6 3 -483.4018 972.8036 981.5636
Step 7 4 -483.3801 974.7602 986.4401
Step 8 5 -483.2287 976.4573 991.0572
Step 9 6 -483.1112 978.2224 995.7423
first of all, why the number of steps between the
above 2 outputs are different? I confirmed with
coxph() that the numbers (log.p.lik, AIC, BIC) on the
1st row of summary(fit.a) are from a NULL Cox model,
i.e. a model with only an intercept. Then how Step 1
in
the output of summary(fit.a) is corresponding to "Step
1" in the output of print(fit.a) where it seems to
mean a model with the variable "karno"?
>predict(fit.a)
trt celltype karno diagtime age prior
1 0.0000 0.0000 0.0000 0.000e+00 0.000e+00 0.000e+00
2 0.0000 0.0076 -0.0256 0.000e+00 0.000e+00 0.000e+00
5 0.0000 0.0450 -0.0286 0.000e+00 0.000e+00 0.000e+00
6 0.1428 0.1033 -0.0330 0.000e+00 0.000e+00
-4.326e-05
7 0.1468 0.1048 -0.0332 0.000e+00 -1.043e-07
-3.506e-04
8 0.1755 0.1139 -0.0340 5.642e-06 -1.404e-03
-2.367e-03
attr(,"s")
[1] 1 2 5 6 7 8
attr(,"fraction")
[1] 0.000 0.125 0.500 0.625 0.750 0.875
attr(,"mode")
[1] "step"
Second, if we compare the output of print(fit.a) and
predict(fit.a), I can see some discrepancies. For
example, "Step 1" of print(fit.a) was variable
"karno", however, predict(fit.a) showed that the
coefficient of "karno" was still 0. The same went with
variable "trt" in "Step 5". What is the meaning of the
discrepancies? I think I probably misunderstand the
whole meaning of coefficient shrinkage in the first
place. So I would appreciate if anyone can shed some
lights.
I would also like to have any opinion on how I should
do variable selection from these output? Should I rely
on the table (log.p.lik, aic, bic) from summary fit.a)
, or should I rely on the coefficients table from
print(fit.a) to eliminate those variables with 0
coefficients at certain step?
Thank you very much for your time.