yongchuan
2006-Oct-23 16:38 UTC
[R] Construction of Dataset for time varying COXPH analysis
Question: When survfit() function is used upon a coxph object, the 'n' returned is vastly smaller (n=6) than the number of distinct loans in the dataset used. I am trying to estimate a Cox proportional hazards model for a set of loans (over 6000) using using time varying covariates. For this 6000+ loans, I have some 62,000 different vectors representing the loans at different periods of time. I did the following: resultsOpt <- coxph(Surv(Start,Stop,PrepayDate)~ closingCoupon + loanPurposeId, data=latest) which returned: Call: coxph(formula = Surv(Start, Stop, PrepayDate) ~ closingCoupon + loanPurposeId, data = latest) coef exp(coef) se(coef) z p closingCoupon 0.101 1.11 0.0271 3.73 1.9e-04 loanPurposeId 0.434 1.54 0.0624 6.96 3.3e-12 Likelihood ratio test=50.3 on 2 df, p=1.18e-11 n= 62297 which seems fair. However when I do:> survfit(resultsOpt)Call: survfit.coxph(object = resultsOpt) n events median 0.95LCL 0.95UCL 6 489 Inf Inf Inf the n = 6 when the number of distinct loans in the dataset is more like 6554. My dataset looks like the following when I call it from within R:> latest[1:5, 1:5]Start Stop PrepayDate modBalance closingCoupon 1 6 7 0 811.2769 8.35 2 7 8 0 811.2769 8.35 3 8 9 1 811.2769 8.35 4 4 5 0 2226.0825 8.70 5 5 6 0 2226.0825 8.70 where the first 3 rows present 1 loan, and the next 2 loans a new one. Am I putting the data in an incorrect format, and if so how should I correct it? Thanks much. Pan