E Joffe
2013-Sep-28 09:39 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
Hi all,
I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of
394 obs. / 268 vars.
I use the following procedure:
1. Construct a coxnet on the entire dataset (by cv.glmnet)
2. Pick the significant features by selecting the non-zero coefficient
under the best lambda selected by the model
3. Build a coxph model with bi-directional stepwise feature selection
limited to the coxnet selected features.
To validate the model I use both Brier score (library=peperr) and Harrel's
C-Index (library=Hmisc) with a bootstrap of 50 iterations.
MY QUESTION : I am getting fairly good C-Index (0.76) and Brier (0.07)
values for the models however per the coxnet the %Dev explained by the model
is at best 0.27 and when I plot the survfit of the coxph the plotted
confidence interval is very large.
What am I missing here ?
%DEV=27%
Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24)
C-Index - 0.76 ($cIndex -> [1] 0.763)
DATA: [Private Health Information - can't publish] 394 obs./268 vars.
CODE (need to define a dataset with 'time' and 'status'
variables):
library("survival")
library ("glmnet")
library ("c060")
library ("peperr")
library ("Hmisc")
#creat Y (survival matrix) for glmnet
surv_obj <- Surv(dataset$time,dataset$status)
## tranform categorical variables into binary variables with dummy for
dataset
predict_matrix <- model.matrix(~ ., data=dataset,
contrasts.arg = lapply
(dataset[,sapply(dataset, is.factor)], contrasts))
## remove the statu/time variables from the predictor matrix (x) for
glmnet
predict_matrix <- subset (predict_matrix, select=c(-time,-status))
## create a glmnet cox object using lasso regularization and cross
validation
glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")
## get the glmnet model on the full dataset
glmnet.obj <- glmnet.cv$glmnet.fit
# find lambda index for the models with least partial likelihood
deviance (by cv.glmnet)
optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous
model use lambda.1se
lambda.index <- which(glmnet.obj$lambda==optimal.lambda)
# take beta for optimal lambda
optimal.beta <- glmnet.obj$beta[,lambda.index]
# find non zero beta coef
nonzero.coef <- abs(optimal.beta)>0
selectedBeta <- optimal.beta[nonzero.coef]
# take only covariates for which beta is not zero
selectedVar <- predict_matrix[,nonzero.coef]
# create a dataframe for trainSet with time, status and selected
variables in binary representation for evaluation in pec
reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
# glmnet.cox only with meaningful features selected by stepwise
bidirectional AIC feature selection
glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
.,data=reformat_dataSet),direction="both")
##--------------------------------------------------------------------------
-----------------------------
## MODEL PERFORMANCE
##--------------------------------------------------------------------------
-----------------------------
##
## Calculate the Brier score - pec does its own bootstrap so this
function runs on i=51 (i.e., whole trainset)
## Brier score calculation to cox-glmnet
peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
fit.fun=fit.coxph, load.all=TRUE,
indices=resample.indices(n=nrow(surv_obj),
method="boot", sample.n=50))
# Get error predictions from peperr
prederr.coxglmnet <- perr(peperr.coxglmnet)
# Integrated prediction error Brier score calculation
ipec.coxglmnet<-ipec(prederr.coxglmnet,
eval.times=peperr.coxglmnet$attribute, response=surv_obj)
## C-Index calculation 50 iter bootstrapping
for (i in 1:50){
print (paste("Iteration:",i))
train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE) ##
random sampling with replacement
# create a dataframe for trainSet with time, status and selected
variables in binary representation for evaluation in pec
reformat_trainSet <- reformat_dataSet [train,]
# glmnet.cox only with meaningful features selected by stepwise
bidirectional AIC feature selection
glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
.,data=reformat_dataSet),direction="both")
selectedVarCox <-
predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")]
reformat_testSet <- as.data.frame(cbind(surv_obj,selectedVarCox))
reformat_testSet <- reformat_dataSet [-train,]
# compute c-index (Harrell's) for cox-glmnet models
if (is.null(glmnet.cox.meaningful)){
cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
}else{
cIndexCoxglmnet <- c(cIndexCoxglmnet,
1-rcorr.cens(predict(glmnet.cox.meaningful,
reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1])
}
}
#Get average C-Index
cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
#create a list of all the objects generated
assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
st(glmnet.obj),
selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
oxglmnet),
cIndex=cIndex))
# save image of the workspace after each iteration
save.image("final_subgroup_analysissubgroup_analysis.RData")
Bert Gunter
2013-Sep-28 14:28 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
This appears to be a statistics, not an R-help question, so should probably be asked on a statistics list, not here (e.g. stats.stackexchange.com). But if I understand your issue correctly, perhaps the heart f the matter is: why do you think a stable fit must explain a lot of the variation? Feel free to ignore if I I'm wrong or discuss further on the statistics list. Cheers, Bert On Sat, Sep 28, 2013 at 2:39 AM, E Joffe <ejoffe at hotmail.com> wrote:> Hi all, > > I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of > 394 obs. / 268 vars. > I use the following procedure: > 1. Construct a coxnet on the entire dataset (by cv.glmnet) > 2. Pick the significant features by selecting the non-zero coefficient > under the best lambda selected by the model > 3. Build a coxph model with bi-directional stepwise feature selection > limited to the coxnet selected features. > > To validate the model I use both Brier score (library=peperr) and Harrel's > C-Index (library=Hmisc) with a bootstrap of 50 iterations. > > > MY QUESTION : I am getting fairly good C-Index (0.76) and Brier (0.07) > values for the models however per the coxnet the %Dev explained by the model > is at best 0.27 and when I plot the survfit of the coxph the plotted > confidence interval is very large. > What am I missing here ? > > %DEV=27% > > > > Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) > C-Index - 0.76 ($cIndex -> [1] 0.763) > > > > DATA: [Private Health Information - can't publish] 394 obs./268 vars. > > CODE (need to define a dataset with 'time' and 'status' variables): > > library("survival") > library ("glmnet") > library ("c060") > library ("peperr") > library ("Hmisc") > > #creat Y (survival matrix) for glmnet > surv_obj <- Surv(dataset$time,dataset$status) > > > ## tranform categorical variables into binary variables with dummy for > dataset > predict_matrix <- model.matrix(~ ., data=dataset, > contrasts.arg = lapply > (dataset[,sapply(dataset, is.factor)], contrasts)) > > ## remove the statu/time variables from the predictor matrix (x) for > glmnet > predict_matrix <- subset (predict_matrix, select=c(-time,-status)) > > ## create a glmnet cox object using lasso regularization and cross > validation > glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") > > ## get the glmnet model on the full dataset > glmnet.obj <- glmnet.cv$glmnet.fit > > # find lambda index for the models with least partial likelihood > deviance (by cv.glmnet) > optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous > model use lambda.1se > lambda.index <- which(glmnet.obj$lambda==optimal.lambda) > > > # take beta for optimal lambda > optimal.beta <- glmnet.obj$beta[,lambda.index] > > # find non zero beta coef > nonzero.coef <- abs(optimal.beta)>0 > selectedBeta <- optimal.beta[nonzero.coef] > > # take only covariates for which beta is not zero > selectedVar <- predict_matrix[,nonzero.coef] > > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > > > > ##-------------------------------------------------------------------------- > ----------------------------- > ## MODEL PERFORMANCE > > ##-------------------------------------------------------------------------- > ----------------------------- > ## > > > ## Calculate the Brier score - pec does its own bootstrap so this > function runs on i=51 (i.e., whole trainset) > > ## Brier score calculation to cox-glmnet > peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, > fit.fun=fit.coxph, load.all=TRUE, > indices=resample.indices(n=nrow(surv_obj), > method="boot", sample.n=50)) > > # Get error predictions from peperr > prederr.coxglmnet <- perr(peperr.coxglmnet) > > # Integrated prediction error Brier score calculation > ipec.coxglmnet<-ipec(prederr.coxglmnet, > eval.times=peperr.coxglmnet$attribute, response=surv_obj) > > > ## C-Index calculation 50 iter bootstrapping > > for (i in 1:50){ > print (paste("Iteration:",i)) > train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE) ## > random sampling with replacement > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_trainSet <- reformat_dataSet [train,] > > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > selectedVarCox <- > predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] > reformat_testSet <- as.data.frame(cbind(surv_obj,selectedVarCox)) > reformat_testSet <- reformat_dataSet [-train,] > > > # compute c-index (Harrell's) for cox-glmnet models > if (is.null(glmnet.cox.meaningful)){ > cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) > }else{ > cIndexCoxglmnet <- c(cIndexCoxglmnet, > 1-rcorr.cens(predict(glmnet.cox.meaningful, > reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1]) > } > } > > #Get average C-Index > cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) > > #create a list of all the objects generated > > assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li > st(glmnet.obj), > > selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), > > glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c > oxglmnet), > cIndex=cIndex)) > > # save image of the workspace after each iteration > save.image("final_subgroup_analysissubgroup_analysis.RData") > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
David Winsemius
2013-Sep-28 14:40 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
On Sep 28, 2013, at 2:39 AM, E Joffe wrote:> Hi all, > > I am using COX LASSO (glmnet / coxnet) regression to analyze a > dataset of > 394 obs. / 268 vars. > I use the following procedure: > 1. Construct a coxnet on the entire dataset (by cv.glmnet) > 2. Pick the significant features by selecting the non-zero coefficient > under the best lambda selected by the model > 3. Build a coxph model with bi-directional stepwise feature selection > limited to the coxnet selected features.I was a bit puzzled by the third step. Once you had a reduced model from glmnet, what was the statistical basis for further elimination of variables? (Quite apart from the statistical issues, I was rather surprised that this procedure even produced results since the 'step' function is not described in the 'stats' package as applying to 'coxph' model objects.)> > To validate the model I use both Brier score (library=peperr) and > Harrel's [Harrell] > C-Index (library=Hmisc) with a bootstrap of 50 iterations. > > > MY QUESTION : I am getting fairly good C-Index (0.76) and Brier > (0.07) > values for the models however per the coxnet the %Dev explained by > the model > is at best 0.27 and when I plot the survfit of the coxph the plotted > confidence interval is very large.How many events did you have? (The width of CI's is most importantly dependent on event counts and not particularly improved by a high case count. The power considerations are very similar to those of a binomial test.)> What am I missing here ?Perhaps sufficient events? (You also seem to be missing a description of the study goals.) -- David.> > %DEV=27% > > > > Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) > C-Index - 0.76 ($cIndex -> [1] 0.763) > > > > DATA: [Private Health Information - can't publish] 394 obs./268 vars. > > CODE (need to define a dataset with 'time' and 'status' variables): > > library("survival") > library ("glmnet") > library ("c060") > library ("peperr") > library ("Hmisc") > > #creat Y (survival matrix) for glmnet > surv_obj <- Surv(dataset$time,dataset$status) > > > ## tranform categorical variables into binary variables with > dummy for > dataset > predict_matrix <- model.matrix(~ ., data=dataset, > contrasts.arg = lapply > (dataset[,sapply(dataset, is.factor)], contrasts)) > > ## remove the statu/time variables from the predictor matrix (x) > for > glmnet > predict_matrix <- subset (predict_matrix, select=c(-time,-status)) > > ## create a glmnet cox object using lasso regularization and cross > validation > glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") > > ## get the glmnet model on the full dataset > glmnet.obj <- glmnet.cv$glmnet.fit > > # find lambda index for the models with least partial likelihood > deviance (by cv.glmnet) > optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous > model use lambda.1se > lambda.index <- which(glmnet.obj$lambda==optimal.lambda) > > > # take beta for optimal lambda > optimal.beta <- glmnet.obj$beta[,lambda.index] > > # find non zero beta coef > nonzero.coef <- abs(optimal.beta)>0 > selectedBeta <- optimal.beta[nonzero.coef] > > # take only covariates for which beta is not zero > selectedVar <- predict_matrix[,nonzero.coef] > > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > > > > ##-------------------------------------------------------------------------- > ----------------------------- > ## MODEL PERFORMANCE > > ##-------------------------------------------------------------------------- > ----------------------------- > ## > > > ## Calculate the Brier score - pec does its own bootstrap so this > function runs on i=51 (i.e., whole trainset) > > ## Brier score calculation to cox-glmnet > peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, > fit.fun=fit.coxph, load.all=TRUE, > > indices=resample.indices(n=nrow(surv_obj), > method="boot", sample.n=50)) > > # Get error predictions from peperr > prederr.coxglmnet <- perr(peperr.coxglmnet) > > # Integrated prediction error Brier score calculation > ipec.coxglmnet<-ipec(prederr.coxglmnet, > eval.times=peperr.coxglmnet$attribute, response=surv_obj) > > > ## C-Index calculation 50 iter bootstrapping > > for (i in 1:50){ > print (paste("Iteration:",i)) > train <- sample(1:nrow(dataset), nrow(dataset), replace = > TRUE) ## > random sampling with replacement > # create a dataframe for trainSet with time, status and > selected > variables in binary representation for evaluation in pec > reformat_trainSet <- reformat_dataSet [train,] > > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > selectedVarCox <- > predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] > reformat_testSet <- > as.data.frame(cbind(surv_obj,selectedVarCox)) > reformat_testSet <- reformat_dataSet [-train,] > > > # compute c-index (Harrell's) for cox-glmnet models > if (is.null(glmnet.cox.meaningful)){ > cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) > }else{ > cIndexCoxglmnet <- c(cIndexCoxglmnet, > 1-rcorr.cens(predict(glmnet.cox.meaningful, > reformat_testSet),Surv(reformat_testSet$time,reformat_testSet > $status))[1]) > } > } > > #Get average C-Index > cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) > > #create a list of all the objects generated > > assign > (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li > st(glmnet.obj), > > selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), > > glmnet > .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c > oxglmnet), > cIndex=cIndex)) > > # save image of the workspace after each iteration > save.image("final_subgroup_analysissubgroup_analysis.RData") > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Alameda, CA, USA
Terry Therneau
2013-Sep-30 14:30 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
To elaborate on Frank's response, the analysis plan of 1. Look at the data and select "important" variables 2. Put that truncated list into your favorite statistic procedure 3. Ask - are the p-values (c-statistic, coefficients, .....) reliable? is a very old plan. The answer to the last question is always "NO". The traditional step 1 involved graphs and t-tests, but replacing it by something modern does not change the overall outcome. Terry T