thr3ads.net - R help - [R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ??? [Sep 2013]

If this information is useful, please help other people find it:
Share via:

E Joffe

2013-Sep-28 09:39 UTC

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

Hi all,

I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of
394 obs. / 268 vars.
I use the following procedure:
1.	Construct a coxnet on the entire dataset (by cv.glmnet) 
2.	Pick the significant features by selecting the non-zero coefficient
under the best lambda selected by the model 
3.	Build a coxph model with bi-directional stepwise feature selection
limited to the coxnet selected features.
	
To validate the model I use both Brier score (library=peperr) and Harrel's
C-Index (library=Hmisc) with a bootstrap of 50 iterations.


MY QUESTION :  I am getting fairly good C-Index (0.76) and Brier (0.07)
values for the models however per the coxnet the %Dev explained by the model
is at best 0.27 and when I plot the survfit of the coxph the plotted
confidence interval is very large. 
What am I missing here ?

%DEV=27%
 


Brier score - 0.07  ($ipec.coxglmnet -> [1] 7.24)
C-Index - 0.76 ($cIndex -> [1] 0.763)



DATA: [Private Health Information - can't publish] 394 obs./268 vars.

CODE (need to define a dataset with 'time' and 'status'
variables):

library("survival")
library ("glmnet")
library ("c060")
library ("peperr")
library ("Hmisc")

    #creat Y (survival matrix) for glmnet
    surv_obj <- Surv(dataset$time,dataset$status) 
    
      
    ## tranform categorical variables into binary variables with dummy for
dataset
    predict_matrix <- model.matrix(~ ., data=dataset, 
                                   contrasts.arg = lapply
(dataset[,sapply(dataset, is.factor)], contrasts))
    
    ## remove the statu/time variables from the predictor matrix (x) for
glmnet
    predict_matrix <- subset (predict_matrix, select=c(-time,-status))
    
    ## create a glmnet cox object using lasso regularization and cross
validation
    glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")
    
    ## get the glmnet model on the full dataset
    glmnet.obj <- glmnet.cv$glmnet.fit
    
    # find lambda index for the models with least partial likelihood
deviance (by cv.glmnet) 
    optimal.lambda <- glmnet.cv$lambda.min    # For a more parsimoneous
model use lambda.1se     
    lambda.index <- which(glmnet.obj$lambda==optimal.lambda) 
    
    
    # take beta for optimal lambda 
    optimal.beta  <- glmnet.obj$beta[,lambda.index] 
    
    # find non zero beta coef 
    nonzero.coef <- abs(optimal.beta)>0 
    selectedBeta <- optimal.beta[nonzero.coef] 
    
    # take only covariates for which beta is not zero 
    selectedVar   <- predict_matrix[,nonzero.coef] 
    
    # create a dataframe for trainSet with time, status and selected
variables in binary representation for evaluation in pec
    reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
    
    # glmnet.cox only with meaningful features selected by stepwise
bidirectional AIC feature selection 
    glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
.,data=reformat_dataSet),direction="both")
    
         
  
 
##--------------------------------------------------------------------------
-----------------------------
    ##                                    MODEL PERFORMANCE
 
##--------------------------------------------------------------------------
-----------------------------
    ## 
    
  
    ## Calculate the Brier score - pec does its own bootstrap so this
function runs on i=51 (i.e., whole trainset)                         

        ## Brier score calculation to cox-glmnet
        peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
                                fit.fun=fit.coxph, load.all=TRUE,
                                indices=resample.indices(n=nrow(surv_obj),
method="boot", sample.n=50))
        
        # Get error predictions from peperr
        prederr.coxglmnet <- perr(peperr.coxglmnet)
        
        # Integrated prediction error Brier score calculation
        ipec.coxglmnet<-ipec(prederr.coxglmnet,
eval.times=peperr.coxglmnet$attribute, response=surv_obj)

    
  ## C-Index calculation 50 iter bootstrapping
  
  for (i in 1:50){
        print (paste("Iteration:",i))
        train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE) ##
random sampling with replacement
        # create a dataframe for trainSet with time, status and selected
variables in binary representation for evaluation in pec
        reformat_trainSet <- reformat_dataSet [train,]
        
        
        # glmnet.cox only with meaningful features selected by stepwise
bidirectional AIC feature selection 
        glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
.,data=reformat_dataSet),direction="both")
        
        selectedVarCox   <-
predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] 
        reformat_testSet <- as.data.frame(cbind(surv_obj,selectedVarCox))
        reformat_testSet <- reformat_dataSet [-train,]
      
        
#     compute c-index (Harrell's) for cox-glmnet models
        if (is.null(glmnet.cox.meaningful)){
          cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
        }else{
          cIndexCoxglmnet <- c(cIndexCoxglmnet,
1-rcorr.cens(predict(glmnet.cox.meaningful,
reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1])
        }
  }
  
  #Get average C-Index
  cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
  
  #create a list of all the objects generated
 
assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
st(glmnet.obj),
 
selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
 
glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
oxglmnet),
                cIndex=cIndex))
    
  # save image of the workspace after each iteration
    save.image("final_subgroup_analysissubgroup_analysis.RData")

Bert Gunter

2013-Sep-28 14:28 UTC

head link

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

This appears to be a statistics, not an R-help question, so should
probably be asked on a statistics list, not here (e.g.
stats.stackexchange.com).

But if I understand your issue correctly, perhaps the heart f the
matter is: why do you think a stable fit must explain a lot of the
variation?  Feel free to ignore if I I'm wrong or discuss further on
the statistics list.

Cheers,
Bert

On Sat, Sep 28, 2013 at 2:39 AM, E Joffe <ejoffe at hotmail.com>
wrote:> Hi all,
>
> I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of
> 394 obs. / 268 vars.
> I use the following procedure:
> 1.      Construct a coxnet on the entire dataset (by cv.glmnet)
> 2.      Pick the significant features by selecting the non-zero coefficient
> under the best lambda selected by the model
> 3.      Build a coxph model with bi-directional stepwise feature selection
> limited to the coxnet selected features.
>
> To validate the model I use both Brier score (library=peperr) and
Harrel's
> C-Index (library=Hmisc) with a bootstrap of 50 iterations.
>
>
> MY QUESTION :  I am getting fairly good C-Index (0.76) and Brier (0.07)
> values for the models however per the coxnet the %Dev explained by the
model
> is at best 0.27 and when I plot the survfit of the coxph the plotted
> confidence interval is very large.
> What am I missing here ?
>
> %DEV=27%
>
>
>
> Brier score - 0.07  ($ipec.coxglmnet -> [1] 7.24)
> C-Index - 0.76 ($cIndex -> [1] 0.763)
>
>
>
> DATA: [Private Health Information - can't publish] 394 obs./268 vars.
>
> CODE (need to define a dataset with 'time' and 'status'
variables):
>
> library("survival")
> library ("glmnet")
> library ("c060")
> library ("peperr")
> library ("Hmisc")
>
>     #creat Y (survival matrix) for glmnet
>     surv_obj <- Surv(dataset$time,dataset$status)
>
>
>     ## tranform categorical variables into binary variables with dummy for
> dataset
>     predict_matrix <- model.matrix(~ ., data=dataset,
>                                    contrasts.arg = lapply
> (dataset[,sapply(dataset, is.factor)], contrasts))
>
>     ## remove the statu/time variables from the predictor matrix (x) for
> glmnet
>     predict_matrix <- subset (predict_matrix, select=c(-time,-status))
>
>     ## create a glmnet cox object using lasso regularization and cross
> validation
>     glmnet.cv <- cv.glmnet (predict_matrix, surv_obj,
family="cox")
>
>     ## get the glmnet model on the full dataset
>     glmnet.obj <- glmnet.cv$glmnet.fit
>
>     # find lambda index for the models with least partial likelihood
> deviance (by cv.glmnet)
>     optimal.lambda <- glmnet.cv$lambda.min    # For a more parsimoneous
> model use lambda.1se
>     lambda.index <- which(glmnet.obj$lambda==optimal.lambda)
>
>
>     # take beta for optimal lambda
>     optimal.beta  <- glmnet.obj$beta[,lambda.index]
>
>     # find non zero beta coef
>     nonzero.coef <- abs(optimal.beta)>0
>     selectedBeta <- optimal.beta[nonzero.coef]
>
>     # take only covariates for which beta is not zero
>     selectedVar   <- predict_matrix[,nonzero.coef]
>
>     # create a dataframe for trainSet with time, status and selected
> variables in binary representation for evaluation in pec
>     reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
>
>     # glmnet.cox only with meaningful features selected by stepwise
> bidirectional AIC feature selection
>     glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
> .,data=reformat_dataSet),direction="both")
>
>
>
>
>
##--------------------------------------------------------------------------
> -----------------------------
>     ##                                    MODEL PERFORMANCE
>
>
##--------------------------------------------------------------------------
> -----------------------------
>     ##
>
>
>     ## Calculate the Brier score - pec does its own bootstrap so this
> function runs on i=51 (i.e., whole trainset)
>
>         ## Brier score calculation to cox-glmnet
>         peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
>                                 fit.fun=fit.coxph, load.all=TRUE,
>                                 indices=resample.indices(n=nrow(surv_obj),
> method="boot", sample.n=50))
>
>         # Get error predictions from peperr
>         prederr.coxglmnet <- perr(peperr.coxglmnet)
>
>         # Integrated prediction error Brier score calculation
>         ipec.coxglmnet<-ipec(prederr.coxglmnet,
> eval.times=peperr.coxglmnet$attribute, response=surv_obj)
>
>
>   ## C-Index calculation 50 iter bootstrapping
>
>   for (i in 1:50){
>         print (paste("Iteration:",i))
>         train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE)
##
> random sampling with replacement
>         # create a dataframe for trainSet with time, status and selected
> variables in binary representation for evaluation in pec
>         reformat_trainSet <- reformat_dataSet [train,]
>
>
>         # glmnet.cox only with meaningful features selected by stepwise
> bidirectional AIC feature selection
>         glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
> .,data=reformat_dataSet),direction="both")
>
>         selectedVarCox   <-
>
predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")]
>         reformat_testSet <-
as.data.frame(cbind(surv_obj,selectedVarCox))
>         reformat_testSet <- reformat_dataSet [-train,]
>
>
> #     compute c-index (Harrell's) for cox-glmnet models
>         if (is.null(glmnet.cox.meaningful)){
>           cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
>         }else{
>           cIndexCoxglmnet <- c(cIndexCoxglmnet,
> 1-rcorr.cens(predict(glmnet.cox.meaningful,
> reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1])
>         }
>   }
>
>   #Get average C-Index
>   cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
>
>   #create a list of all the objects generated
>
>
assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
> st(glmnet.obj),
>
> selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
>
>
glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
> oxglmnet),
>                 cIndex=cIndex))
>
>   # save image of the workspace after each iteration
>     save.image("final_subgroup_analysissubgroup_analysis.RData")
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

David Winsemius

2013-Sep-28 14:40 UTC

head link

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

On Sep 28, 2013, at 2:39 AM, E Joffe wrote:
> Hi all,
>
> I am using COX LASSO (glmnet / coxnet) regression to analyze a  
> dataset of
> 394 obs. / 268 vars.
> I use the following procedure:
> 1.	Construct a coxnet on the entire dataset (by cv.glmnet)
> 2.	Pick the significant features by selecting the non-zero coefficient
> under the best lambda selected by the model
> 3.	Build a coxph model with bi-directional stepwise feature selection
> limited to the coxnet selected features.
I was a bit puzzled by the third step. Once you had a reduced model  
from glmnet, what was the statistical basis for further elimination of  
variables?

(Quite apart from the statistical issues, I was rather surprised that  
this procedure even produced results since the 'step' function is not  
described in the 'stats' package as applying to 'coxph' model
objects.)
> 	
> To validate the model I use both Brier score (library=peperr) and  
> Harrel's [Harrell]
> C-Index (library=Hmisc) with a bootstrap of 50 iterations.
>
>
> MY QUESTION :  I am getting fairly good C-Index (0.76) and Brier  
> (0.07)
> values for the models however per the coxnet the %Dev explained by  
> the model
> is at best 0.27 and when I plot the survfit of the coxph the plotted
> confidence interval is very large.
How many events did you have?  (The width of CI's is most importantly  
dependent on event counts and not particularly improved by a high case  
count. The power considerations are very similar to those of a  
binomial test.)

> What am I missing here ?
Perhaps sufficient events? (You also seem to be missing a description  
of the study goals.)


-- 
David.
>
> %DEV=27%
>
>
>
> Brier score - 0.07  ($ipec.coxglmnet -> [1] 7.24)
> C-Index - 0.76 ($cIndex -> [1] 0.763)
>
>
>
> DATA: [Private Health Information - can't publish] 394 obs./268 vars.
>
> CODE (need to define a dataset with 'time' and 'status'
variables):
>
> library("survival")
> library ("glmnet")
> library ("c060")
> library ("peperr")
> library ("Hmisc")
>
>    #creat Y (survival matrix) for glmnet
>    surv_obj <- Surv(dataset$time,dataset$status)
>
>
>    ## tranform categorical variables into binary variables with  
> dummy for
> dataset
>    predict_matrix <- model.matrix(~ ., data=dataset,
>                                   contrasts.arg = lapply
> (dataset[,sapply(dataset, is.factor)], contrasts))
>
>    ## remove the statu/time variables from the predictor matrix (x)  
> for
> glmnet
>    predict_matrix <- subset (predict_matrix, select=c(-time,-status))
>
>    ## create a glmnet cox object using lasso regularization and cross
> validation
>    glmnet.cv <- cv.glmnet (predict_matrix, surv_obj,
family="cox")
>
>    ## get the glmnet model on the full dataset
>    glmnet.obj <- glmnet.cv$glmnet.fit
>
>    # find lambda index for the models with least partial likelihood
> deviance (by cv.glmnet)
>    optimal.lambda <- glmnet.cv$lambda.min    # For a more parsimoneous
> model use lambda.1se
>    lambda.index <- which(glmnet.obj$lambda==optimal.lambda)
>
>
>    # take beta for optimal lambda
>    optimal.beta  <- glmnet.obj$beta[,lambda.index]
>
>    # find non zero beta coef
>    nonzero.coef <- abs(optimal.beta)>0
>    selectedBeta <- optimal.beta[nonzero.coef]
>
>    # take only covariates for which beta is not zero
>    selectedVar   <- predict_matrix[,nonzero.coef]
>
>    # create a dataframe for trainSet with time, status and selected
> variables in binary representation for evaluation in pec
>    reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
>
>    # glmnet.cox only with meaningful features selected by stepwise
> bidirectional AIC feature selection
>    glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
> .,data=reformat_dataSet),direction="both")
>
>
>
>
>
##--------------------------------------------------------------------------
> -----------------------------
>    ##                                    MODEL PERFORMANCE
>
>
##--------------------------------------------------------------------------
> -----------------------------
>    ##
>
>
>    ## Calculate the Brier score - pec does its own bootstrap so this
> function runs on i=51 (i.e., whole trainset)
>
>        ## Brier score calculation to cox-glmnet
>        peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
>                                fit.fun=fit.coxph, load.all=TRUE,
>                                 
> indices=resample.indices(n=nrow(surv_obj),
> method="boot", sample.n=50))
>
>        # Get error predictions from peperr
>        prederr.coxglmnet <- perr(peperr.coxglmnet)
>
>        # Integrated prediction error Brier score calculation
>        ipec.coxglmnet<-ipec(prederr.coxglmnet,
> eval.times=peperr.coxglmnet$attribute, response=surv_obj)
>
>
>  ## C-Index calculation 50 iter bootstrapping
>
>  for (i in 1:50){
>        print (paste("Iteration:",i))
>        train <- sample(1:nrow(dataset), nrow(dataset), replace =  
> TRUE) ##
> random sampling with replacement
>        # create a dataframe for trainSet with time, status and  
> selected
> variables in binary representation for evaluation in pec
>        reformat_trainSet <- reformat_dataSet [train,]
>
>
>        # glmnet.cox only with meaningful features selected by stepwise
> bidirectional AIC feature selection
>        glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
> .,data=reformat_dataSet),direction="both")
>
>        selectedVarCox   <-
>
predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")]
>        reformat_testSet <-  
> as.data.frame(cbind(surv_obj,selectedVarCox))
>        reformat_testSet <- reformat_dataSet [-train,]
>
>
> #     compute c-index (Harrell's) for cox-glmnet models
>        if (is.null(glmnet.cox.meaningful)){
>          cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
>        }else{
>          cIndexCoxglmnet <- c(cIndexCoxglmnet,
> 1-rcorr.cens(predict(glmnet.cox.meaningful,
> reformat_testSet),Surv(reformat_testSet$time,reformat_testSet 
> $status))[1])
>        }
>  }
>
>  #Get average C-Index
>  cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
>
>  #create a list of all the objects generated
>
> assign 
> (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
> st(glmnet.obj),
>
> selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
>
> glmnet 
> .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
> oxglmnet),
>                cIndex=cIndex))
>
>  # save image of the workspace after each iteration
>    save.image("final_subgroup_analysissubgroup_analysis.RData")
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA

Terry Therneau

2013-Sep-30 14:30 UTC

head link

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

To elaborate on Frank's response, the analysis plan of
   1. Look at the data and select "important" variables
   2. Put that truncated list into your favorite statistic procedure
   3. Ask - are the p-values (c-statistic, coefficients, .....) reliable?

is a very old plan.  The answer to the last question is always "NO".
The traditional step 1 involved graphs and t-tests, but replacing it by
something modern
does not change the overall outcome.

Terry T

R help - Sep 2013 - Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???