Mohiuddin, Jahan
2012-Jul-05  18:12 UTC
[R] Confused about multiple imputation with rms or Hmisc packages
Hello, I'm working on a Cox Proportional Hazards model for a cancer data set that has missing values for the categorical variable "Grade" in less than 10% of the observations. I'm not a statistician, but based on my readings of Frank Harrell's book it seems to be a candidate for using multiple imputation technique(s). I understand the concepts behind imputation, but using the functions in rms and Hmisc is confounding me. For instance, whether to use transcan or aregImpute. Here is a sample of my data: https://dl.dropbox.com/u/1852742/sample.csv Drawing from Chapter 8 of Harrell's book, this is what I've been toying with: #recurfree_survival_fromsx is survival time, rf_obs_sx codes for events as a binary variable. #The CPH model I would like to fit, using Ograde_dx as the variable for overall grade at #diagnosis, ord_nodes as an ordinal variable for the # lymph nodes involved. obj=with(mydata, Surv(recurfree_survival_fromsx,rf_obs_sx)) mod=cph(obj~ord_nodes+Ograde_dx+ERorPR+HER2_Sum,data=mydata,x=T,y=T) #Impute missing data mydata.transcan=transcan(~Ograde_dx+tumorsize+ord_nodes+simp_stage_path+afam+ Menopause+Age,imputed=T,n.impute=10) summary(mydata.transcan) The issues I have are: a) In your opinion(s), should I even be imputing this data? Is it appropriate here? b) Even after reading the help pages and Harrell's book, I'm not sure I used the correct imputation method, and whether I should be using transcan or aregImpute. c) In the output of summary(transcan), is R-squared the best value to describe how reliably the function could predict Ograde_dx? What is an acceptable level? d) Do I use the function fit.mult.impute to fit my final cph model? I appreciate your help with this as it is a somewhat confusing topic. I hope I gave you all the information you need to answer my questions. Sincerely, Jahan [[alternative HTML version deleted]]
