thr3ads.net - R help - [R] coxph won't converge when including categorical (factor) variables [Jul 2013]

If this information is useful, please help other people find it:
Share via:

E Joffe

2013-Jul-06 12:04 UTC

[R] coxph won't converge when including categorical (factor) variables

Hello,

 

[rephrasing and reposting  of a previous question (that was not answered)
with new information]

 

I have a dataset of 371 observations.

When I run coxph with numeric variables it works fine.

However, when I try to add factor (categorical) variables it returns "Ran
out of iterations and the model did not converge"

 

Of note, when I restructure all factors to binary variables with dummy and
use glmnet-lasso the model converges.

 

Here are examples of the code and output (including summary description of
the variables):
> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
 

Warning message:

In fitter(X, Y, strats, offset, init, control, weights = weights,  :

  Ran out of iterations and did not converge

 
> summary (dataset$Chemo_Simple)
                        Anthra_HDAC       Anthra_Plus       ArsenicAtra
ATRA           ATRA_GO 

                0               163                 2                12
0                 2 

         ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
HDAC_Clof         HDAC_only 

                0                34                37                 4
24                 1 

        HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
Phase1 

                4                 8                30                 5
5 

              SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
VNP40101M 

                0                 0                 0                13
23 

 

 

HELP !!!!


	[[alternative HTML version deleted]]

Bert Gunter

2013-Jul-06 13:42 UTC

head link

[R] coxph won't converge when including categorical (factor) variables

Hint: Why do you think no one replied?

1. That it converged with one model/fitting algorithm and not with
another is useless.

2. This cannot possibly be answered without your data.

3. A guess: you are overfitting -- glmnet regularizes and will
therefore fit (highly) correlated regressors coxph won't. .

-- Bert

On Sat, Jul 6, 2013 at 5:04 AM, E Joffe <ejoffe at hotmail.com>
wrote:> Hello,
>
>
>
> [rephrasing and reposting  of a previous question (that was not answered)
> with new information]
>
>
>
> I have a dataset of 371 observations.
>
> When I run coxph with numeric variables it works fine.
>
> However, when I try to add factor (categorical) variables it returns
"Ran
> out of iterations and the model did not converge"
>
>
>
> Of note, when I restructure all factors to binary variables with dummy and
> use glmnet-lasso the model converges.
>
>
>
> Here are examples of the code and output (including summary description of
> the variables):
>
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
>
>
>
> Warning message:
>
> In fitter(X, Y, strats, offset, init, control, weights = weights,  :
>
>   Ran out of iterations and did not converge
>
>
>
>> summary (dataset$Chemo_Simple)
>
>                         Anthra_HDAC       Anthra_Plus       ArsenicAtra
> ATRA           ATRA_GO
>
>                 0               163                 2                12
> 0                 2
>
>          ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
> HDAC_Clof         HDAC_only
>
>                 0                34                37                 4
> 24                 1
>
>         HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
> Phase1
>
>                 4                 8                30                 5
> 5
>
>               SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
> VNP40101M
>
>                 0                 0                 0                13
> 23
>
>
>
>
>
> HELP !!!!
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

Marc Schwartz

2013-Jul-06 13:46 UTC

head link

[R] coxph won't converge when including categorical (factor) variables

On Jul 6, 2013, at 7:04 AM, E Joffe <ejoffe at hotmail.com> wrote:
> Hello,
> 
> 
> 
> [rephrasing and reposting  of a previous question (that was not answered)
> with new information]
> 
> 
> 
> I have a dataset of 371 observations.
> 
> When I run coxph with numeric variables it works fine.
> 
> However, when I try to add factor (categorical) variables it returns
"Ran
> out of iterations and the model did not converge"
> 
> 
> 
> Of note, when I restructure all factors to binary variables with dummy and
> use glmnet-lasso the model converges.
> 
> 
> 
> Here are examples of the code and output (including summary description of
> the variables):
> 
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
> 
> 
> 
> Warning message:
> 
> In fitter(X, Y, strats, offset, init, control, weights = weights,  :
> 
>  Ran out of iterations and did not converge
> 
> 
> 
>> summary (dataset$Chemo_Simple)
> 
>                        Anthra_HDAC       Anthra_Plus       ArsenicAtra
> ATRA           ATRA_GO 
> 
>                0               163                 2                12
> 0                 2 
> 
>         ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
> HDAC_Clof         HDAC_only 
> 
>                0                34                37                 4
> 24                 1 
> 
>        HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
> Phase1 
> 
>                4                 8                30                 5
> 5 
> 
>              SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
> VNP40101M 
> 
>                0                 0                 0                13
> 23 
> 
> 
> 
> 
> 
> HELP !!!!
> 

You have 371 observations, but did not indicate how many events you have in that
dataset. A cross tabulation of the 'status' factor with Chemo_Simple is
likely to be enlightening to get a sense of the distribution of events for each
level of Chemo_Simple.

Chemo_Simple has a number of levels with rather low counts (ignoring the levels
with 0's for the moment), which is likely to be a part of the problem. There
is likely to be an issue fitting the model for some of these levels, bearing in
mind that your "reference level" of Anthra_HDAC has a large proportion
of the observations and with the default treatment contrasts, each of the other
levels will be compared to it.

You should also run:

  dataset$Chemo_Simple <- factor(dataset$Chemo_Simple)

to get rid of the unused factor levels. A quick check of the coxph() code
versus, for example, lm()/glm(), reveals that while the latter will drop unused
factor levels when creating the internal model dataframe, the former will not. A
check of some test output here with coxph() suggests that the unused factor
levels will simply appear as NA's in the coxph() output without affecting
the levels that are present, but it will be cleaner to remove them a priori.

It is also likely that you don't have enough events to handle the effective
covariate degrees of freedom that this single factor model will have. By my
count (the formatting above is corrupted), you have 16 non-zero factor levels,
which would be 15 covariate degrees of freedom consumed by this single factor. A
general rule of thumb for Cox models (and logistic regression) is to have 20
events per covariate degree of freedom to avoid overfitting, which means that
you would really need 300 "events". In this context, events are the
smaller count of the two possible levels of "status". Since you only
have 371 total observations, there is no way that you could have 300 events,
since you would need to have at least 600 observations. Thus, it is likely that
you are attempting to overfit the model to the data as well.

The issue of using glmnet on a matrix of separate dummy variables and it working
is likely to be an outcome of the LASSO method penalizing/shrinking the factor
levels that are irrelevant to have coefficients of 0. Thus, I would envision
that the effective number of your factor levels is reduced as a consequence,
allowing the model to be fit.

You likely need to consider collapsing some of the low count factor levels into
an "Other" category, if it makes contextual sense to do so for your
data.

Regards,

Marc Schwartz

Jeff Newmiller

2013-Jul-06 16:26 UTC

head link

[R] coxph won't converge when including categorical (factor) variables

I don't know much about your problem, but if you want help you are going to
need to learn to communicate effectively. This may help:

http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

E Joffe <ejoffe at hotmail.com> wrote:
>Hello,
>
> 
>
>[rephrasing and reposting  of a previous question (that was not
>answered)
>with new information]
>
> 
>
>I have a dataset of 371 observations.
>
>When I run coxph with numeric variables it works fine.
>
>However, when I try to add factor (categorical) variables it returns
>"Ran
>out of iterations and the model did not converge"
>
> 
>
>Of note, when I restructure all factors to binary variables with dummy
>and
>use glmnet-lasso the model converges.
>
> 
>
>Here are examples of the code and output (including summary description
>of
>the variables):
>
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
>
> 
>
>Warning message:
>
>In fitter(X, Y, strats, offset, init, control, weights = weights,  :
>
>  Ran out of iterations and did not converge
>
> 
>
>> summary (dataset$Chemo_Simple)
>
>                        Anthra_HDAC       Anthra_Plus       ArsenicAtra
>ATRA           ATRA_GO 
>
>                0               163                 2                12
>0                 2 
>
>         ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
>HDAC_Clof         HDAC_only 
>
>                0                34                37                 4
>24                 1 
>
>        HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
>Phase1 
>
>                4                 8                30                 5
>5 
>
>              SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
>VNP40101M 
>
>                0                 0                 0                13
>23 
>
> 
>
> 
>
>HELP !!!!
>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

E Joffe

2013-Jul-11 13:16 UTC

head link

[R] coxph won't converge when including categorical (factor) variables

I have found the answer for my second question and am posting here for the
benefit of the community:

pec accepts only R-objects for which a predictSurvProb method exists and glmnet
is not such an object.

    Currently, predictSurvProb methods are available for the following
R-objects:
    matrix
    aalen, cox.aalen from library(timereg)
    mfp from library(mfp)
    phnnet, survnnet from library(survnnet)
    rpart (from library(rpart))
    coxph, survfit from library(survival)
    cph, psm from library(rms)
    prodlim from library(prodlim)
    glm from library(stats)

For calculating the brier score for glmnet one needs to use the peperr package
with the c060 library that wraps glmnet as an object suitable for peperr.

    peperr_glmnet_noerror <- peperr(response=Surv(time, status), x=x, 
                        fit.fun=fit.glmnet,
args.fit=list(family="cox"),

complexity=complexity.glmnet,args.complexity=list(family="cox"),
                        indices=resample.indices(n=length(time),
method="boot", sample.n=10))

To get the integrated brier score for the entire model it seems one needs to use
the ipec function but I still need to research that.

Many thanks to Thomas Hielscher who authored the c060 package and was extremely
kind to help me with this.

-----Original Message-----
From: E Joffe [mailto:ejoffe at hotmail.com] 
Sent: Sunday, July 07, 2013 10:02 AM
To: 'Marc Schwartz'
Cc: 'r-help at r-project.org'
Subject: RE: [R] coxph won't converge when including categorical (factor)
variables

Hello all,

I have posted a reply to Mark and Jeff out of my own fault sent it to them
personally and not the group.
I will re-post the question and Mark's answer so that the community can
benefit from his comments.
It seems that my posts are not in accordance with the guidelines and I will try
to remedy that as well.
I apologize for the mess I have made !!!
I am afraid I can't post the actual data (as it is Protected Health
Information).

In any case here are the question, code, output and Mark's comment (which
can be summarized that there is very little information about the pec package
within the community and that the best solution would be to try and contact the
author of the package - which I will and then repost the answer):

I am attempting to evaluate the prediction error of a coxph model that was built
after feature selection with glmnet.

In the preprocessing stage I used na.omit (dataset) to remove NAs. I
reconstructed all my factor variables into binary variables with dummies (using
model.matrix) I then used glmnet lasso to fit a cox model and select the best
performing features. Then I fit a coxph model using only these feature.

When I try to evaluate the model using pec and a bootstrap I get an error that
the prediction matrix has wrong dimensions.

In the original dataset there are 313 variables. In the coxph model (glmnet.cox
in my code) there are 313 df. However when I run pec the prediction matrix is
noted as having 318 variables and therefore doesn't fit the testset which
has 313 variables (+the status and time variables).

It seems that pec does something to the variables while retraining the coxph
model on the bootstrap samples. I tried setting singular.ok=FALSE in the coxph
model without avail.

Here is a summary of key variables (prior to restructuring for glmnet) Followed
by the code I ran and the errors:
> summary (trainSet)     time              status         Age_at_Dx       CREATININE   
 Min.   :  0.1429   Min.   :0.0000   Min.   :14.81   Min.   :0.400  
 1st Qu.: 22.0714   1st Qu.:1.0000   1st Qu.:52.51   1st Qu.:0.800  
 Median : 52.9286   Median :1.0000   Median :63.79   Median :0.900  
 Mean   : 96.5415   Mean   :0.7826   Mean   :61.01   Mean   :1.042  
 3rd Qu.:154.6071   3rd Qu.:1.0000   3rd Qu.:73.01   3rd Qu.:1.125  
 Max.   :437.7143   Max.   :1.0000   Max.   :87.23   Max.   :6.000  
 Performance_Status    ALBUMIN          Cyto             WBC        
 Min.   :0.0000     Min.   :0.70   Min.   : 1.000   Min.   :  0.60  
 1st Qu.:1.0000     1st Qu.:3.00   1st Qu.: 4.000   1st Qu.:  3.20  
     Median :1.0000     Median :3.60   Median : 4.000   Median :  9.20  
 Mean   :0.9728     Mean   :3.48   Mean   : 6.802   Mean   : 28.35  
 3rd Qu.:1.0000     3rd Qu.:4.00   3rd Qu.:11.000   3rd Qu.: 33.10  
 Max.   :4.0000     Max.   :5.20   Max.   :17.000   Max.   :373.00  
  PRKCD_pT507             HGB            Maxblast          CD19       
 Min.   :-2.429605   Min.   : 5.400   Min.   : 0.00   Min.   : 0.000  
  1st Qu.:-0.627005   1st Qu.: 8.500   1st Qu.:24.00   1st Qu.: 1.000  
 Median :-0.013117   Median : 9.550   Median :40.00   Median : 2.000  
 Mean   : 0.006432   Mean   : 9.689   Mean   :43.74   Mean   : 6.312  
 3rd Qu.: 0.559782   3rd Qu.:10.800   3rd Qu.:65.25   3rd Qu.: 5.000  
 Max.   : 2.963658   Max.   :14.300   Max.   :98.00   Max.   :98.000  
     GAPDH                CD74              TP53               Fli1          
 Min.   :-2.391021   Min.   :-1.7902   Min.   :-1.64244   Min.   :-3.723143  
 1st Qu.:-0.662164   1st Qu.:-0.5502   1st Qu.:-0.53685   1st Qu.:-0.559783  
 Median : 0.003236   Median :-0.1546   Median :-0.10928   Median :-0.002154  
 Mean   : 0.015759   Mean   : 0.0235   Mean   :-0.02285   Mean   :-0.016555  
 3rd Qu.: 0.632472   3rd Qu.: 0.4759   3rd Qu.: 0.29869   3rd Qu.: 0.554521  
 Max.   : 3.512741   Max.   : 3.7226   Max.   : 5.39242   Max.   : 4.415324  
      LDH              SQSTM0            BM_BLAST         CCND3         
 Min.   :    8.4   Min.   :-1.56639   Min.   : 0.00   Min.   :-2.44772  
 1st Qu.:  562.8   1st Qu.:-0.48762   1st Qu.:31.00   1st Qu.:-0.59042  
 Median :  952.5   Median :-0.05746   Median :46.50   Median :-0.08412  
 Mean   : 1617.9   Mean   :-0.02272   Mean   :50.64   Mean   :-0.02506  
 3rd Qu.: 1596.8   3rd Qu.: 0.33718   3rd Qu.:72.00   3rd Qu.: 0.48353  
 Max.   :36708.0   Max.   : 3.59456   Max.   :98.00   Max.   : 3.58761  
      GAB2                BAX               ABS_BLST           PRKAA1_2       
 Min.   :-3.201564   Min.   :-3.230737   Min.   :     0.0   Min.   :-1.90671  
 1st Qu.:-0.640279   1st Qu.:-0.593174   1st Qu.:    99.8   1st Qu.:-0.66896  
 Median : 0.007018   Median :-0.106097   Median :  1836.5   Median :-0.10586  
 Mean   :-0.013284   Mean   :-0.007051   Mean   : 15024.1   Mean   :-0.03531  
  3rd Qu.: 0.635609   3rd Qu.: 0.588098   3rd Qu.: 11178.0   3rd Qu.: 0.44906  
 Max.   : 2.396419   Max.   : 3.439942   Max.   :358080.0   Max.   : 3.60037  
    RPS6KB1              SPP1               MAPT                ARC          
 Min.   :-2.05490   Min.   :-1.39728   Min.   :-1.826101   Min.   :-2.67081  
 1st Qu.:-0.65736   1st Qu.:-0.55249   1st Qu.:-0.573470   1st Qu.:-0.63908  
 Median :-0.12007   Median :-0.15156   Median :-0.046306   Median :-0.07845  
 Mean   :-0.06115   Mean   : 0.02759   Mean   : 0.009604   Mean   :-0.01082  
 3rd Qu.: 0.49055   3rd Qu.: 0.35195   3rd Qu.: 0.495406   3rd Qu.: 0.63638  
  Max.   : 3.54709   Max.   : 4.44919   Max.   : 4.063568   Max.   : 2.52810  
  FOXO1_pT24_FOXO3_pT32      ATG7               SMAD5           RPS6_pS235_236
  Min.   :-2.2506       Min.   :-2.916018   Min.   :-2.321250   Min.  
:-3.229642
  1st Qu.:-0.5590       1st Qu.:-0.621408   1st Qu.:-0.480924   1st
Qu.:-0.633143
  Median :-0.1281       Median : 0.003795   Median : 0.003186   Median
:-0.003345
  Mean   :-0.0121       Mean   :-0.029199   Mean   : 0.010663   Mean  
:-0.015028
  3rd Qu.: 0.4415       3rd Qu.: 0.590721   3rd Qu.: 0.502824   3rd Qu.:
0.660224
  Max.   : 3.1239       Max.   : 2.757525   Max.   : 2.238310   Max.   :
2.816105
      LSD1             HDAC2               SFN           
  Min.   :-4.5434   Min.   :-2.40682   Min.   :-2.082592  
  1st Qu.:-0.5654   1st Qu.:-0.49165   1st Qu.:-0.554862  
  Median : 0.0477   Median : 0.02828   Median :-0.116248  
  Mean   :-0.0165   Mean   : 0.04641   Mean   : 0.006255  
  3rd Qu.: 0.6044   3rd Qu.: 0.47826   3rd Qu.: 0.506527  
  Max.   : 1.8736   Max.   : 4.22724   Max.   : 3.733445

library (survival)
library (pec)
library (glmnet)

##-----------------------------------------------------------------------------------
-------------------
   ##                                    FEATURE SELECTION BY GLMNET-LASSO

##----------------------------------------------------------------------------------
---------------------
   ## 
   ## This function takes a predictor matrix and a Surv obj and uses the glmnet
lasso      regularization
   ## method to select the most predictive features and create a coxph object
   ## predict_matrix is the 'dataset' reformated to replace categorical
variables to     binary with dummy

#creat Y (survival matrix) for glmnet
  surv_obj <- Surv(trainSet$time,trainSet$status) 

 ## tranform categorical variables into binary variables with dummy for trainSet
  predict_matrix <- model.matrix(~ ., data=trainSet, 
                                 contrasts.arg = lapply
(trainSet[,sapply(trainSet,     is.factor)], contrasts, contrasts=FALSE))

  ## remove the status/time variables from the predictor matrix (x) for glmnet
  predict_matrix <- subset (predict_matrix, select=c(-time,-status))

  ## create a glmnet cox object using lasso regularization
  glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox")

  # find lambda for which dev.ratio is max
  max.dev.index <- which.max(glmnet.obj$dev.ratio) 
      optimal.lambda <- glmnet.obj$lambda[max.dev.index] 

  # take beta for optimal lambda
  optimal.beta  <- glmnet.obj$beta[,max.dev.index] 

  # find non zero beta coef
  nonzero.coef <- abs(optimal.beta)>0
  selectedBeta <- optimal.beta[nonzero.coef] 

  # take only covariates for which beta is not zero 
  selectedVar   <- predict_matrix[,nonzero.coef] 

 # create a dataframe for trainSet with time, status and selected variables in
binary      representation for evaluation in pec
 reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))

  # create coxph object with pre-defined coefficients 
   glmnet.cox <- coxph(surv_obj ~ selectedVar,data =
reformat_dataSet,model=TRUE,      singular.ok=FALSE,init=selectedBeta,iter=0)

 ##------------------------------------------------------------------------------------
-------------------
  ##                                    MODEL PERFORMANCE

##-----------------------------------------------------------------------------------
--------------------
  ##
  set.seed(17743)
  pec.f <- as.formula(Hist(time,status) ~ 1)

   ## Brier score for the cox-glmnet model
   brierGlmnet <- pec(list(glmnet.cox), data= reformat_dataSet ,
splitMethod="BootCV",  B=50)

Here is the result:

Prediction matrix has wrong dimensions:
368 rows and 318 columns.           --->(but there are only 313 predicting
vars     in the dataset)
 But requested are predicted probabilities for
 138 subjects (rows) in newdata and 315 time points (columns) ---? (315=313
predicting     vars+status+time)
This may happen when some covariate values are missing in newdata!?
Warning in FUN(1:2[[2L]], ...) :

Oddly, when I call print on the glmnet.cox object I get that it has 313
variables and not as pec outputs (318) print(glmnet.cox) Likelihood ratio test=0
on 313 df, p=1 n= 368, number of events= 288

Here is Mark's response:

I suspect that you are going to need to contact the author/maintainer of the pec
package for detailed assistance. From a search of the R list archives, there are
not that many posts pertaining to the package, which is generally an indication
of the breadth of usage within the community. The lack of relevant posts would
likely correlate to the low number of users who may be in a position to assist
you. Thus, the package author/maintainer in this case, will likely be the best
and most expedient resource.

I hope that you find the above useful in some manner.

Regards,

Marc

-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at me.com]
Sent: Saturday, July 06, 2013 8:46 AM
To: E Joffe
Cc: r-help at r-project.org
Subject: Re: [R] coxph won't converge when including categorical (factor)
variables

On Jul 6, 2013, at 7:04 AM, E Joffe <ejoffe at hotmail.com> wrote:
> Hello,
> 
> 
> 
> [rephrasing and reposting  of a previous question (that was not
> answered) with new information]
> 
> 
> 
> I have a dataset of 371 observations.
> 
> When I run coxph with numeric variables it works fine.
> 
> However, when I try to add factor (categorical) variables it returns 
> "Ran out of iterations and the model did not converge"
> 
> 
> 
> Of note, when I restructure all factors to binary variables with dummy 
> and use glmnet-lasso the model converges.
> 
> 
> 
> Here are examples of the code and output (including summary 
> description of the variables):
> 
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
> 
> 
> 
> Warning message:
> 
> In fitter(X, Y, strats, offset, init, control, weights = weights,  :
> 
>  Ran out of iterations and did not converge
> 
> 
> 
>> summary (dataset$Chemo_Simple)
> 
>                        Anthra_HDAC       Anthra_Plus       ArsenicAtra
> ATRA           ATRA_GO 
> 
>                0               163                 2                12
> 0                 2 
> 
>         ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
> HDAC_Clof         HDAC_only 
> 
>                0                34                37                 4
> 24                 1 
> 
>        HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
> Phase1
> 
>                4                 8                30                 5
> 5
> 
>              SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
> VNP40101M
> 
>                0                 0                 0                13
> 23
> 
> 
> 
> 
> 
> HELP !!!!
> 

You have 371 observations, but did not indicate how many events you have in that
dataset. A cross tabulation of the 'status' factor with Chemo_Simple is
likely to be enlightening to get a sense of the distribution of events for each
level of Chemo_Simple.

Chemo_Simple has a number of levels with rather low counts (ignoring the levels
with 0's for the moment), which is likely to be a part of the problem. There
is likely to be an issue fitting the model for some of these levels, bearing in
mind that your "reference level" of Anthra_HDAC has a large proportion
of the observations and with the default treatment contrasts, each of the other
levels will be compared to it.

You should also run:

  dataset$Chemo_Simple <- factor(dataset$Chemo_Simple)

to get rid of the unused factor levels. A quick check of the coxph() code
versus, for example, lm()/glm(), reveals that while the latter will drop unused
factor levels when creating the internal model dataframe, the former will not. A
check of some test output here with coxph() suggests that the unused factor
levels will simply appear as NA's in the coxph() output without affecting
the levels that are present, but it will be cleaner to remove them a priori.

It is also likely that you don't have enough events to handle the effective
covariate degrees of freedom that this single factor model will have. By my
count (the formatting above is corrupted), you have 16 non-zero factor levels,
which would be 15 covariate degrees of freedom consumed by this single factor. A
general rule of thumb for Cox models (and logistic regression) is to have 20
events per covariate degree of freedom to avoid overfitting, which means that
you would really need 300 "events". In this context, events are the
smaller count of the two possible levels of "status". Since you only
have 371 total observations, there is no way that you could have 300 events,
since you would need to have at least 600 observations. Thus, it is likely that
you are attempting to overfit the model to the data as well.

The issue of using glmnet on a matrix of separate dummy variables and it working
is likely to be an outcome of the LASSO method penalizing/shrinking the factor
levels that are irrelevant to have coefficients of 0. Thus, I would envision
that the effective number of your factor levels is reduced as a consequence,
allowing the model to be fit.

You likely need to consider collapsing some of the low count factor levels into
an "Other" category, if it makes contextual sense to do so for your
data.

Regards,

Marc Schwartz

R help - Jul 2013 - coxph won't converge when including categorical (factor) variables

[R] coxph won't converge when including categorical (factor) variables

[R] coxph won't converge when including categorical (factor) variables

[R] coxph won't converge when including categorical (factor) variables

[R] coxph won't converge when including categorical (factor) variables

[R] coxph won't converge when including categorical (factor) variables