thr3ads.net - R help - [R] Help with hazard plots [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Alan Cox

2008-Jul-31 21:29 UTC

[R] Help with hazard plots

Hello. ?I am hoping someone will be willing to help me understand something
about hazard plots created with muhaz(...). ?I have some background in
statistics (minor in grad school), but I haven't been able to figure one
thing about hazard plots. ?I am using hazard plots to track customer
cancellations. ?I figure I can treat a cancellation as a "death", and
if someone is still a customer today, they're right censored. ?I know that a
hazard plot shows the probability that someone will cancel in month? n ?given
that they're a customer in month n-1 .


If a customer signs up on January 1st and cancels on January 2nd, we've had
what I thought was an intellectual but pointless debate about whether we count
that as being a customer for 1 month or 0 months. ?I thought the two plots would
be identical, except for a different X axis.


However, when I create the two plots, they are very different ... very, very
different. ?I've posted the two plots to Flickr:


http://flickr.com/photos/alancox/2720915878/in/photostream/ shows the plot where
the lifetime of a customer who signs up on Jan 1 and cancels on Jan 2 is 0.

http://flickr.com/photos/alancox/2720915904/in/photostream/ shows the plot where
the lifetime of a customer who signs up on Jan 1 and cancels on Jan 2 is 1.

My question is: Why are these two so different?  How do I know which is right?

The call that I'm making to produce the model is:

hazardV08 <- muhaz(nmc,s,max.time=max(nmc))


-- 
Alan Cox 
Director, User Experience 
iContact, Corp. 
p 919.459.1038 f 919.287.2475

Alan Cox

2008-Jul-31 21:37 UTC

head link

[R] Help with hazard plots

I clicked "Send" before making sure I thanked anyone who took the time
to help me out.  Sorry about that.  To all who read or respond: thanks!.

----- Original Message ----- 
From: "Alan Cox" <acox@icontact.com> 
To: r-help@r-project.org 
Sent: Thursday, July 31, 2008 5:29:35 PM GMT -05:00 US/Canada Eastern 
Subject: [R] Help with hazard plots 

Hello.  I am hoping someone will be willing to help me understand something
about hazard plots created with muhaz(...).  I have some background in
statistics (minor in grad school), but I haven't been able to figure one
thing about hazard plots.  I am using hazard plots to track customer
cancellations.  I figure I can treat a cancellation as a "death", and
if someone is still a customer today, they're right censored.  I know that a
hazard plot shows the probability that someone will cancel in month  n  given
that they're a customer in month n-1 .

If a customer signs up on January 1st and cancels on January 2nd, we've had
what I thought was an intellectual but pointless debate about whether we count
that as being a customer for 1 month or 0 months.  I thought the two plots would
be identical, except for a different X axis.

However, when I create the two plots, they are very different ... very, very
different.  I've posted the two plots to Flickr:

http://flickr.com/photos/alancox/2720915878/in/photostream/ shows the plot where
the lifetime of a customer who signs up on Jan 1 and cancels on Jan 2 is 0.

http://flickr.com/photos/alancox/2720915904/in/photostream/ shows the plot where
the lifetime of a customer who signs up on Jan 1 and cancels on Jan 2 is 1.

My question is: Why are these two so different?  How do I know which is right? 

The call that I'm making to produce the model is: 

hazardV08 <- muhaz(nmc,s,max.time=max(nmc)) 

-- 
Alan Cox 
Director, User Experience 
iContact, Corp. 
p 919.459.1038 f 919.287.2475 

______________________________________________ 
R-help@r-project.org mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code. 

-- 
Alan Cox 
Director, User Experience 
iContact, Corp. 
p 919.459.1038 f 919.287.2475 

	[[alternative HTML version deleted]]

Marc Schwartz

2008-Aug-01 04:09 UTC

head link

[R] Help with hazard plots

on 07/31/2008 04:29 PM Alan Cox wrote:> Hello.  I am hoping someone will be willing to help me understand
> something about hazard plots created with muhaz(...).  I have some
> background in statistics (minor in grad school), but I haven't been
> able to figure one thing about hazard plots.  I am using hazard plots
> to track customer cancellations.  I figure I can treat a cancellation
> as a "death", and if someone is still a customer today,
they're right
> censored.  I know that a hazard plot shows the probability that
> someone will cancel in month  n  given that they're a customer in
> month n-1 .
> 
> 
> If a customer signs up on January 1st and cancels on January 2nd,
> we've had what I thought was an intellectual but pointless debate
> about whether we count that as being a customer for 1 month or 0
> months.  I thought the two plots would be identical, except for a
> different X axis.
> 
> 
> However, when I create the two plots, they are very different ...
> very, very different.  I've posted the two plots to Flickr:
> 
> 
> http://flickr.com/photos/alancox/2720915878/in/photostream/ shows the
> plot where the lifetime of a customer who signs up on Jan 1 and
> cancels on Jan 2 is 0.
> 
> http://flickr.com/photos/alancox/2720915904/in/photostream/ shows the
> plot where the lifetime of a customer who signs up on Jan 1 and
> cancels on Jan 2 is 1.
> 
> My question is: Why are these two so different?  How do I know which
> is right?
> 
> The call that I'm making to produce the model is:
> 
> hazardV08 <- muhaz(nmc,s,max.time=max(nmc))

I suspect that there is more here than meets the eye.

Lacking your data and the actual code that you are using to generate the 
two different curves, this could be anything from the way in which you 
have coded/collapsed/truncated the event intervals, to the way in which 
muhaz() is fitting the smoothed curve to each of the two data sets.

The "correct" way to track the intervals would be to use a resolution
of
days, which could be transformed into months and fractions thereof (eg. 
by dividing days by 30.44) if you prefer. The day of sign-up would be 
Day 0 and each subsequent calendar day would increment the interval by 
one day.

So based upon your example above (sign-up on Jan 1, cancel on Jan 2), 
the customer would have an "event" on day 1 or 0.03285151 months.

All of your censored events (clients that have not yet canceled) should 
have their intervals based upon their own Time 0 (sign-up day) to 
whatever date you are using as your end point. I am guessing that you 
might have some form of paid membership, such that as long as the 
customer is paying, they are considered active, as opposed to a customer 
who simply stops doing business with you and you don't know. If the 
customer is paying some type of monthly fee, for example, then you 
should really censor them based upon their last payment date, not 
today's date, since the last payment date is when you know that they are 
still a paying client.

This would be akin to patient coming in for a follow up contact, at 
which point you know they are still alive. Once they leave the office, 
you don't know if they are alive until the next actual contact as they 
might be hit by a car walking to the parking lot.

Based upon your comments above, where you appear to have information on 
a daily basis, if you might be collapsing time into integer months, you 
are losing information. The kernel based approach that is used by 
muhaz() as I understand it, is highly sensitive to small datasets and 
the granularity of the data, among other things.

You might want to review the online complement to MASS4 by Venables and 
Ripley here:

   http://www.stats.ox.ac.uk/pub/MASS4/#Complements

and review the section on survival analysis, which covers smoothing 
functions for survival.

You might also want to simply consider using a standard Kaplan-Meier 
non-parametric estimator using survfit() in the survival package. The 
function calls for your data should be something like:

   library(survival)

   summary(survfit(Surv(nmc, s)))

and

   plot(survfit(Surv(nmc, s)))

HTH,

Marc Schwartz

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Jul 2008 - Help with hazard plots

[R] Help with hazard plots

[R] Help with hazard plots

[R] Help with hazard plots

Seemingly Similar Threads