On Wed, 25 Oct 2006, yongchuan wrote:
> I've a data set with 60000 rows of data representing 6000+ distinct
loans. I did a coxph() regression on it (see call below), but a subsequent
survfit() call on the coxph object is almost certainly wrong. It gives n=6 when
it should be
> more like 6000+ (I think)
>
>> survfit(resultag)
> Call: survfit.coxph(object = resultag)
>
> n events median 0.95LCL 0.95UCL
> 6 489 Inf 2 Inf
>
> When I reduced the dataset to just 1000 rows, the survfit()
> call on the coxph object looks more correct.
>
>> survfit(resulting)
> Call: survfit.coxph(object = resulting)
>
> n events median 0.95LCL 0.95UCL
> 115 15 Inf Inf Inf
>
> Is there a limit to the size of the data set that I read in?
> Or am I just doing something silly above?
>
> (this is the coxph regression:
> resultag <- coxph(Surv(Start,Stop,PrepayDate)~modBalance +
closingCoupon+lienPosition +originalFICO,table)
>
You may be misunderstanding the `n` column in the output. If you read the
help for print.survfit you will find:
The "number of observations" is not well-defined for counting
process data. Previous versions of this code used the number at
risk at the first time point. This is misleading if many
individuals enter late or change strata. The original S code for
the current version uses the number of records, which is
misleading when the counting process data actually represent a
fixed cohort with time-dependent covariates.
Four possibilities are provided, controlled by 'print.n' or by
'options(survfit.print.n)': '"none"' prints
'NA', '"records"'
prints the number of records, '"start"' prints the
number at the
first time point and '"max"' prints the maximum number
at risk.
The initial default is '"start"'.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle