on 07/31/2008 04:29 PM Alan Cox wrote:> Hello. I am hoping someone will be willing to help me understand
> something about hazard plots created with muhaz(...). I have some
> background in statistics (minor in grad school), but I haven't been
> able to figure one thing about hazard plots. I am using hazard plots
> to track customer cancellations. I figure I can treat a cancellation
> as a "death", and if someone is still a customer today,
they're right
> censored. I know that a hazard plot shows the probability that
> someone will cancel in month n given that they're a customer in
> month n-1 .
>
>
> If a customer signs up on January 1st and cancels on January 2nd,
> we've had what I thought was an intellectual but pointless debate
> about whether we count that as being a customer for 1 month or 0
> months. I thought the two plots would be identical, except for a
> different X axis.
>
>
> However, when I create the two plots, they are very different ...
> very, very different. I've posted the two plots to Flickr:
>
>
> http://flickr.com/photos/alancox/2720915878/in/photostream/ shows the
> plot where the lifetime of a customer who signs up on Jan 1 and
> cancels on Jan 2 is 0.
>
> http://flickr.com/photos/alancox/2720915904/in/photostream/ shows the
> plot where the lifetime of a customer who signs up on Jan 1 and
> cancels on Jan 2 is 1.
>
> My question is: Why are these two so different? How do I know which
> is right?
>
> The call that I'm making to produce the model is:
>
> hazardV08 <- muhaz(nmc,s,max.time=max(nmc))
I suspect that there is more here than meets the eye.
Lacking your data and the actual code that you are using to generate the
two different curves, this could be anything from the way in which you
have coded/collapsed/truncated the event intervals, to the way in which
muhaz() is fitting the smoothed curve to each of the two data sets.
The "correct" way to track the intervals would be to use a resolution
of
days, which could be transformed into months and fractions thereof (eg.
by dividing days by 30.44) if you prefer. The day of sign-up would be
Day 0 and each subsequent calendar day would increment the interval by
one day.
So based upon your example above (sign-up on Jan 1, cancel on Jan 2),
the customer would have an "event" on day 1 or 0.03285151 months.
All of your censored events (clients that have not yet canceled) should
have their intervals based upon their own Time 0 (sign-up day) to
whatever date you are using as your end point. I am guessing that you
might have some form of paid membership, such that as long as the
customer is paying, they are considered active, as opposed to a customer
who simply stops doing business with you and you don't know. If the
customer is paying some type of monthly fee, for example, then you
should really censor them based upon their last payment date, not
today's date, since the last payment date is when you know that they are
still a paying client.
This would be akin to patient coming in for a follow up contact, at
which point you know they are still alive. Once they leave the office,
you don't know if they are alive until the next actual contact as they
might be hit by a car walking to the parking lot.
Based upon your comments above, where you appear to have information on
a daily basis, if you might be collapsing time into integer months, you
are losing information. The kernel based approach that is used by
muhaz() as I understand it, is highly sensitive to small datasets and
the granularity of the data, among other things.
You might want to review the online complement to MASS4 by Venables and
Ripley here:
http://www.stats.ox.ac.uk/pub/MASS4/#Complements
and review the section on survival analysis, which covers smoothing
functions for survival.
You might also want to simply consider using a standard Kaplan-Meier
non-parametric estimator using survfit() in the survival package. The
function calls for your data should be something like:
library(survival)
summary(survfit(Surv(nmc, s)))
and
plot(survfit(Surv(nmc, s)))
HTH,
Marc Schwartz