E Joffe
2013-Jul-06 00:16 UTC
[R] problem with BootCV for coxph in pec after feature selection with glmnet (lasso)
Hi, I am attempting to evaluate the prediction error of a coxph model that was built after feature selection with glmnet. In the preprocessing stage I used na.omit (dataset) to remove NAs. I reconstructed all my factor variables into binary variables with dummies (using model.matrix) I then used glmnet lasso to fit a cox model and select the best performing features. Then I fit a coxph model using only these feature. When I try to evaluate the model using pec and a bootstrap I get an error that the prediction matrix has wrong dimensions. Suddenly the cox object has 318 variables instead of 356 variables in the dataset. I don't know why this is happening. The cox object I assign to pec and the dataframe are both of the same size. However, once pec refits the model its size changes (356 -> 318 variables). Apparently something is happening during the bootstrap sampling that removes some variables. As mentioned, I used na.omit in the preprocessing so there should not be any NAs. Here are some details from my workspace: Reformat_dataSet: 368 obs. Of 356 variables print (glmnet.cox) ----> 354 df, p=1 n= 368, number of events= 288 (354 df = 354 variables + time and status => 356 variables) Here is the pec function and the error: pec.f <- as.formula(Hist(time,status) ~ 1) brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset, splitMethod="BootCV", B=50, formula = pec.f)> Error in predictSurvProb.coxph(object = structure(list(coefficients structure(c(-4.27787223119601, :Prediction matrix has wrong dimensions: 368 rows and 318 columns. But requested are predicted probabilities for 118 subjects (rows) in newdata and 356 time points (columns) This may happen when some covariate values are missing in newdata!? Here are the relevant sections of the code: trainSet <- na.omit (dataset) #creat Y (survival matrix) for glmnet surv_obj <- Surv(trainSet$time,trainSet$status) ## tranform categorical variables into binary variables with dummy for trainSet predict_matrix <- model.matrix(~ ., data=trainSet, contrasts.arg = lapply (trainSet[,sapply(trainSet, is.factor)], contrasts, contrasts=FALSE)) ## remove the statu/time variables from the predictor matrix (x) for glmnet predict_matrix <- subset (predict_matrix, select=c(-time,-status)) ## create a glmnet cox object using lasso regularization glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox") # find lambda for which dev.ratio is max max.dev.index <- which.max(glmnet.obj$dev.ratio) optimal.lambda <- glmnet.obj$lambda[max.dev.index] # take beta for optimal lambda optimal.beta <- glmnet.obj$beta[,max.dev.index] # find non zero beta coef nonzero.coef <- abs(optimal.beta)>0 selectedBeta <- optimal.beta[nonzero.coef] # take only covariates for which beta is not zero selectedVar <- predict_matrix[,nonzero.coef] # create a dataframe for trainSet with time, status and selected variables in binary representation for evaluation in pec reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) # create coxph object with pre-defined coefficients glmnet.cox <- coxph(surv_obj ~ selectedVar, init=selectedBeta,iter=0) ## Brier score for the cox-glmnet model brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset, splitMethod="BootCV", B=50, formula = pec.f) Thank you !!!
E Joffe
2013-Jul-11 13:13 UTC
[R] problem with BootCV for coxph in pec after feature selection with glmnet (lasso)
pec accepts only R-objects for which a predictSurvProb method exists and glmnet is not such an object. Currently, predictSurvProb methods are available for the following R-objects: matrix aalen, cox.aalen from library(timereg) mfp from library(mfp) phnnet, survnnet from library(survnnet) rpart (from library(rpart)) coxph, survfit from library(survival) cph, psm from library(rms) prodlim from library(prodlim) glm from library(stats) For calculating the brier score for glmnet one needs to use the peperr package with the c060 library that wraps glmnet as an object suitable for peperr. peperr_glmnet_noerror <- peperr(response=Surv(time, status), x=x, fit.fun=fit.glmnet, args.fit=list(family="cox"), complexity=complexity.glmnet,args.complexity=list(family="cox"), indices=resample.indices(n=length(time), method="boot", sample.n=10)) To get the integrated brier score for the entire model it seems one needs to use the ipec function but I still need to research that. Many thanks to Thomas Hielscher who authored the c060 package and was extremely kind to help me with this. -----Original Message----- From: E Joffe [mailto:ejoffe at hotmail.com] Sent: Friday, July 05, 2013 7:16 PM To: 'r-help at R-project.org' Subject: problem with BootCV for coxph in pec after feature selection with glmnet (lasso) Hi, I am attempting to evaluate the prediction error of a coxph model that was built after feature selection with glmnet. In the preprocessing stage I used na.omit (dataset) to remove NAs. I reconstructed all my factor variables into binary variables with dummies (using model.matrix) I then used glmnet lasso to fit a cox model and select the best performing features. Then I fit a coxph model using only these feature. When I try to evaluate the model using pec and a bootstrap I get an error that the prediction matrix has wrong dimensions. Suddenly the cox object has 318 variables instead of 356 variables in the dataset. I don't know why this is happening. The cox object I assign to pec and the dataframe are both of the same size. However, once pec refits the model its size changes (356 -> 318 variables). Apparently something is happening during the bootstrap sampling that removes some variables. As mentioned, I used na.omit in the preprocessing so there should not be any NAs. Here are some details from my workspace: Reformat_dataSet: 368 obs. Of 356 variables print (glmnet.cox) ----> 354 df, p=1 n= 368, number of events= 288 (354 df = 354 variables + time and status => 356 variables) Here is the pec function and the error: pec.f <- as.formula(Hist(time,status) ~ 1) brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset, splitMethod="BootCV", B=50, formula = pec.f)> Error in predictSurvProb.coxph(object = structure(list(coefficients structure(c(-4.27787223119601, :Prediction matrix has wrong dimensions: 368 rows and 318 columns. But requested are predicted probabilities for 118 subjects (rows) in newdata and 356 time points (columns) This may happen when some covariate values are missing in newdata!? Here are the relevant sections of the code: trainSet <- na.omit (dataset) #creat Y (survival matrix) for glmnet surv_obj <- Surv(trainSet$time,trainSet$status) ## tranform categorical variables into binary variables with dummy for trainSet predict_matrix <- model.matrix(~ ., data=trainSet, contrasts.arg = lapply (trainSet[,sapply(trainSet, is.factor)], contrasts, contrasts=FALSE)) ## remove the statu/time variables from the predictor matrix (x) for glmnet predict_matrix <- subset (predict_matrix, select=c(-time,-status)) ## create a glmnet cox object using lasso regularization glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox") # find lambda for which dev.ratio is max max.dev.index <- which.max(glmnet.obj$dev.ratio) optimal.lambda <- glmnet.obj$lambda[max.dev.index] # take beta for optimal lambda optimal.beta <- glmnet.obj$beta[,max.dev.index] # find non zero beta coef nonzero.coef <- abs(optimal.beta)>0 selectedBeta <- optimal.beta[nonzero.coef] # take only covariates for which beta is not zero selectedVar <- predict_matrix[,nonzero.coef] # create a dataframe for trainSet with time, status and selected variables in binary representation for evaluation in pec reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) # create coxph object with pre-defined coefficients glmnet.cox <- coxph(surv_obj ~ selectedVar, init=selectedBeta,iter=0) ## Brier score for the cox-glmnet model brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset, splitMethod="BootCV", B=50, formula = pec.f) Thank you !!!
Seemingly Similar Threads
- Creating dummy vars with contrasts - why does the returned identity matrix contain all levels (and not n-1 levels) ?
- Pec function in R
- rspec on rails: undefined method ''controller_name''
- [PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.
- R: Re: R: Re: Dovecot 2.0 shared mailbox using ACLs problem