E Joffe
2013-Sep-28 09:39 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
Hi all, I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of 394 obs. / 268 vars. I use the following procedure: 1. Construct a coxnet on the entire dataset (by cv.glmnet) 2. Pick the significant features by selecting the non-zero coefficient under the best lambda selected by the model 3. Build a coxph model with bi-directional stepwise feature selection limited to the coxnet selected features. To validate the model I use both Brier score (library=peperr) and Harrel's C-Index (library=Hmisc) with a bootstrap of 50 iterations. MY QUESTION : I am getting fairly good C-Index (0.76) and Brier (0.07) values for the models however per the coxnet the %Dev explained by the model is at best 0.27 and when I plot the survfit of the coxph the plotted confidence interval is very large. What am I missing here ? %DEV=27% Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) C-Index - 0.76 ($cIndex -> [1] 0.763) DATA: [Private Health Information - can't publish] 394 obs./268 vars. CODE (need to define a dataset with 'time' and 'status' variables): library("survival") library ("glmnet") library ("c060") library ("peperr") library ("Hmisc") #creat Y (survival matrix) for glmnet surv_obj <- Surv(dataset$time,dataset$status) ## tranform categorical variables into binary variables with dummy for dataset predict_matrix <- model.matrix(~ ., data=dataset, contrasts.arg = lapply (dataset[,sapply(dataset, is.factor)], contrasts)) ## remove the statu/time variables from the predictor matrix (x) for glmnet predict_matrix <- subset (predict_matrix, select=c(-time,-status)) ## create a glmnet cox object using lasso regularization and cross validation glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") ## get the glmnet model on the full dataset glmnet.obj <- glmnet.cv$glmnet.fit # find lambda index for the models with least partial likelihood deviance (by cv.glmnet) optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous model use lambda.1se lambda.index <- which(glmnet.obj$lambda==optimal.lambda) # take beta for optimal lambda optimal.beta <- glmnet.obj$beta[,lambda.index] # find non zero beta coef nonzero.coef <- abs(optimal.beta)>0 selectedBeta <- optimal.beta[nonzero.coef] # take only covariates for which beta is not zero selectedVar <- predict_matrix[,nonzero.coef] # create a dataframe for trainSet with time, status and selected variables in binary representation for evaluation in pec reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) # glmnet.cox only with meaningful features selected by stepwise bidirectional AIC feature selection glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ .,data=reformat_dataSet),direction="both") ##-------------------------------------------------------------------------- ----------------------------- ## MODEL PERFORMANCE ##-------------------------------------------------------------------------- ----------------------------- ## ## Calculate the Brier score - pec does its own bootstrap so this function runs on i=51 (i.e., whole trainset) ## Brier score calculation to cox-glmnet peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, fit.fun=fit.coxph, load.all=TRUE, indices=resample.indices(n=nrow(surv_obj), method="boot", sample.n=50)) # Get error predictions from peperr prederr.coxglmnet <- perr(peperr.coxglmnet) # Integrated prediction error Brier score calculation ipec.coxglmnet<-ipec(prederr.coxglmnet, eval.times=peperr.coxglmnet$attribute, response=surv_obj) ## C-Index calculation 50 iter bootstrapping for (i in 1:50){ print (paste("Iteration:",i)) train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE) ## random sampling with replacement # create a dataframe for trainSet with time, status and selected variables in binary representation for evaluation in pec reformat_trainSet <- reformat_dataSet [train,] # glmnet.cox only with meaningful features selected by stepwise bidirectional AIC feature selection glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ .,data=reformat_dataSet),direction="both") selectedVarCox <- predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] reformat_testSet <- as.data.frame(cbind(surv_obj,selectedVarCox)) reformat_testSet <- reformat_dataSet [-train,] # compute c-index (Harrell's) for cox-glmnet models if (is.null(glmnet.cox.meaningful)){ cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) }else{ cIndexCoxglmnet <- c(cIndexCoxglmnet, 1-rcorr.cens(predict(glmnet.cox.meaningful, reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1]) } } #Get average C-Index cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) #create a list of all the objects generated assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li st(glmnet.obj), selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c oxglmnet), cIndex=cIndex)) # save image of the workspace after each iteration save.image("final_subgroup_analysissubgroup_analysis.RData")
Bert Gunter
2013-Sep-28 14:28 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
This appears to be a statistics, not an R-help question, so should probably be asked on a statistics list, not here (e.g. stats.stackexchange.com). But if I understand your issue correctly, perhaps the heart f the matter is: why do you think a stable fit must explain a lot of the variation? Feel free to ignore if I I'm wrong or discuss further on the statistics list. Cheers, Bert On Sat, Sep 28, 2013 at 2:39 AM, E Joffe <ejoffe at hotmail.com> wrote:> Hi all, > > I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of > 394 obs. / 268 vars. > I use the following procedure: > 1. Construct a coxnet on the entire dataset (by cv.glmnet) > 2. Pick the significant features by selecting the non-zero coefficient > under the best lambda selected by the model > 3. Build a coxph model with bi-directional stepwise feature selection > limited to the coxnet selected features. > > To validate the model I use both Brier score (library=peperr) and Harrel's > C-Index (library=Hmisc) with a bootstrap of 50 iterations. > > > MY QUESTION : I am getting fairly good C-Index (0.76) and Brier (0.07) > values for the models however per the coxnet the %Dev explained by the model > is at best 0.27 and when I plot the survfit of the coxph the plotted > confidence interval is very large. > What am I missing here ? > > %DEV=27% > > > > Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) > C-Index - 0.76 ($cIndex -> [1] 0.763) > > > > DATA: [Private Health Information - can't publish] 394 obs./268 vars. > > CODE (need to define a dataset with 'time' and 'status' variables): > > library("survival") > library ("glmnet") > library ("c060") > library ("peperr") > library ("Hmisc") > > #creat Y (survival matrix) for glmnet > surv_obj <- Surv(dataset$time,dataset$status) > > > ## tranform categorical variables into binary variables with dummy for > dataset > predict_matrix <- model.matrix(~ ., data=dataset, > contrasts.arg = lapply > (dataset[,sapply(dataset, is.factor)], contrasts)) > > ## remove the statu/time variables from the predictor matrix (x) for > glmnet > predict_matrix <- subset (predict_matrix, select=c(-time,-status)) > > ## create a glmnet cox object using lasso regularization and cross > validation > glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") > > ## get the glmnet model on the full dataset > glmnet.obj <- glmnet.cv$glmnet.fit > > # find lambda index for the models with least partial likelihood > deviance (by cv.glmnet) > optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous > model use lambda.1se > lambda.index <- which(glmnet.obj$lambda==optimal.lambda) > > > # take beta for optimal lambda > optimal.beta <- glmnet.obj$beta[,lambda.index] > > # find non zero beta coef > nonzero.coef <- abs(optimal.beta)>0 > selectedBeta <- optimal.beta[nonzero.coef] > > # take only covariates for which beta is not zero > selectedVar <- predict_matrix[,nonzero.coef] > > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > > > > ##-------------------------------------------------------------------------- > ----------------------------- > ## MODEL PERFORMANCE > > ##-------------------------------------------------------------------------- > ----------------------------- > ## > > > ## Calculate the Brier score - pec does its own bootstrap so this > function runs on i=51 (i.e., whole trainset) > > ## Brier score calculation to cox-glmnet > peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, > fit.fun=fit.coxph, load.all=TRUE, > indices=resample.indices(n=nrow(surv_obj), > method="boot", sample.n=50)) > > # Get error predictions from peperr > prederr.coxglmnet <- perr(peperr.coxglmnet) > > # Integrated prediction error Brier score calculation > ipec.coxglmnet<-ipec(prederr.coxglmnet, > eval.times=peperr.coxglmnet$attribute, response=surv_obj) > > > ## C-Index calculation 50 iter bootstrapping > > for (i in 1:50){ > print (paste("Iteration:",i)) > train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE) ## > random sampling with replacement > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_trainSet <- reformat_dataSet [train,] > > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > selectedVarCox <- > predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] > reformat_testSet <- as.data.frame(cbind(surv_obj,selectedVarCox)) > reformat_testSet <- reformat_dataSet [-train,] > > > # compute c-index (Harrell's) for cox-glmnet models > if (is.null(glmnet.cox.meaningful)){ > cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) > }else{ > cIndexCoxglmnet <- c(cIndexCoxglmnet, > 1-rcorr.cens(predict(glmnet.cox.meaningful, > reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1]) > } > } > > #Get average C-Index > cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) > > #create a list of all the objects generated > > assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li > st(glmnet.obj), > > selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), > > glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c > oxglmnet), > cIndex=cIndex)) > > # save image of the workspace after each iteration > save.image("final_subgroup_analysissubgroup_analysis.RData") > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
David Winsemius
2013-Sep-28 14:40 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
On Sep 28, 2013, at 2:39 AM, E Joffe wrote:> Hi all, > > I am using COX LASSO (glmnet / coxnet) regression to analyze a > dataset of > 394 obs. / 268 vars. > I use the following procedure: > 1. Construct a coxnet on the entire dataset (by cv.glmnet) > 2. Pick the significant features by selecting the non-zero coefficient > under the best lambda selected by the model > 3. Build a coxph model with bi-directional stepwise feature selection > limited to the coxnet selected features.I was a bit puzzled by the third step. Once you had a reduced model from glmnet, what was the statistical basis for further elimination of variables? (Quite apart from the statistical issues, I was rather surprised that this procedure even produced results since the 'step' function is not described in the 'stats' package as applying to 'coxph' model objects.)> > To validate the model I use both Brier score (library=peperr) and > Harrel's [Harrell] > C-Index (library=Hmisc) with a bootstrap of 50 iterations. > > > MY QUESTION : I am getting fairly good C-Index (0.76) and Brier > (0.07) > values for the models however per the coxnet the %Dev explained by > the model > is at best 0.27 and when I plot the survfit of the coxph the plotted > confidence interval is very large.How many events did you have? (The width of CI's is most importantly dependent on event counts and not particularly improved by a high case count. The power considerations are very similar to those of a binomial test.)> What am I missing here ?Perhaps sufficient events? (You also seem to be missing a description of the study goals.) -- David.> > %DEV=27% > > > > Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) > C-Index - 0.76 ($cIndex -> [1] 0.763) > > > > DATA: [Private Health Information - can't publish] 394 obs./268 vars. > > CODE (need to define a dataset with 'time' and 'status' variables): > > library("survival") > library ("glmnet") > library ("c060") > library ("peperr") > library ("Hmisc") > > #creat Y (survival matrix) for glmnet > surv_obj <- Surv(dataset$time,dataset$status) > > > ## tranform categorical variables into binary variables with > dummy for > dataset > predict_matrix <- model.matrix(~ ., data=dataset, > contrasts.arg = lapply > (dataset[,sapply(dataset, is.factor)], contrasts)) > > ## remove the statu/time variables from the predictor matrix (x) > for > glmnet > predict_matrix <- subset (predict_matrix, select=c(-time,-status)) > > ## create a glmnet cox object using lasso regularization and cross > validation > glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") > > ## get the glmnet model on the full dataset > glmnet.obj <- glmnet.cv$glmnet.fit > > # find lambda index for the models with least partial likelihood > deviance (by cv.glmnet) > optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous > model use lambda.1se > lambda.index <- which(glmnet.obj$lambda==optimal.lambda) > > > # take beta for optimal lambda > optimal.beta <- glmnet.obj$beta[,lambda.index] > > # find non zero beta coef > nonzero.coef <- abs(optimal.beta)>0 > selectedBeta <- optimal.beta[nonzero.coef] > > # take only covariates for which beta is not zero > selectedVar <- predict_matrix[,nonzero.coef] > > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > > > > ##-------------------------------------------------------------------------- > ----------------------------- > ## MODEL PERFORMANCE > > ##-------------------------------------------------------------------------- > ----------------------------- > ## > > > ## Calculate the Brier score - pec does its own bootstrap so this > function runs on i=51 (i.e., whole trainset) > > ## Brier score calculation to cox-glmnet > peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, > fit.fun=fit.coxph, load.all=TRUE, > > indices=resample.indices(n=nrow(surv_obj), > method="boot", sample.n=50)) > > # Get error predictions from peperr > prederr.coxglmnet <- perr(peperr.coxglmnet) > > # Integrated prediction error Brier score calculation > ipec.coxglmnet<-ipec(prederr.coxglmnet, > eval.times=peperr.coxglmnet$attribute, response=surv_obj) > > > ## C-Index calculation 50 iter bootstrapping > > for (i in 1:50){ > print (paste("Iteration:",i)) > train <- sample(1:nrow(dataset), nrow(dataset), replace = > TRUE) ## > random sampling with replacement > # create a dataframe for trainSet with time, status and > selected > variables in binary representation for evaluation in pec > reformat_trainSet <- reformat_dataSet [train,] > > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > selectedVarCox <- > predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] > reformat_testSet <- > as.data.frame(cbind(surv_obj,selectedVarCox)) > reformat_testSet <- reformat_dataSet [-train,] > > > # compute c-index (Harrell's) for cox-glmnet models > if (is.null(glmnet.cox.meaningful)){ > cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) > }else{ > cIndexCoxglmnet <- c(cIndexCoxglmnet, > 1-rcorr.cens(predict(glmnet.cox.meaningful, > reformat_testSet),Surv(reformat_testSet$time,reformat_testSet > $status))[1]) > } > } > > #Get average C-Index > cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) > > #create a list of all the objects generated > > assign > (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li > st(glmnet.obj), > > selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), > > glmnet > .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c > oxglmnet), > cIndex=cIndex)) > > # save image of the workspace after each iteration > save.image("final_subgroup_analysissubgroup_analysis.RData") > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Alameda, CA, USA
Terry Therneau
2013-Sep-30 14:30 UTC
[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
To elaborate on Frank's response, the analysis plan of 1. Look at the data and select "important" variables 2. Put that truncated list into your favorite statistic procedure 3. Ask - are the p-values (c-statistic, coefficients, .....) reliable? is a very old plan. The answer to the last question is always "NO". The traditional step 1 involved graphs and t-tests, but replacing it by something modern does not change the overall outcome. Terry T