Geraldine Henningsen
2008-Dec-23 19:08 UTC
[R] Interval censored Data in survreg() with zero values!
Hello, I have interval censored data, censored between (0, 100). I used the tobit function in the AER package which in turn backs on survreg. Actually I'm struggling with the distribution. Data is asymmetrically distributed, so first choice would be a Weibull distribution. Unfortunately the Weibull doesn't allow for zero values in time data, as it requires x > 0. So I tried the exponential distribution that allows x to be >= 0 and the log-normal that sets x <= 0 to 0. Still I get the same error: " Fehler in survreg(formula = Surv(ifelse(A16_1_1 >= 100, 100, ifelse(A16_1_1 <= : Invalid survival times for this distribution " The only distributions that seem to work are gaussian and logistic, but they don't really fit the data. I searched for this problem in the archive and found a suggestion by Terry Therneau to set all 0 to NA, applying Weibull afterwards. But this solution is not very satisfying as it eliminates the left censored data from the dataset. So I have three questions: 1. Does anybody know why the lognormal and exponential distribution don't work in survreg? 2. What else could I do to find a distribution that fits the data well? 3. What about the non-parametric approach in survfit(), could that be a solution? I hope my question aren't too stupid, as I'm not a big statistician. Regards, Geraldine
Don MacQueen
2008-Dec-23 19:35 UTC
[R] Interval censored Data in survreg() with zero values!
Surv() allows left, right, or interval censoring. Try left censoring instead of interval censoring. For the weibull or lognormal, think of your data as <=100 instead of [0,100]. -Don At 8:08 PM +0100 12/23/08, Geraldine Henningsen wrote:>Hello, > >I have interval censored data, censored between (0, 100). I used the >tobit function in the AER package which in turn backs on survreg. >Actually I'm struggling with the distribution. Data is asymmetrically >distributed, so first choice would be a Weibull distribution. >Unfortunately the Weibull doesn't allow for zero values in time data, >as it requires x > 0. So I tried the exponential distribution that >allows x to be >= 0 and the log-normal that sets x <= 0 to 0. Still I >get the same error: > >" Fehler in survreg(formula = Surv(ifelse(A16_1_1 >= 100, 100, >ifelse(A16_1_1 <= : > Invalid survival times for this distribution " > >The only distributions that seem to work are gaussian and logistic, but >they don't really fit the data. >I searched for this problem in the archive and found a suggestion by >Terry Therneau to set all 0 to NA, applying Weibull afterwards. But >this solution is not very satisfying as it eliminates the left censored >data from the dataset. > >So I have three questions: > >1. Does anybody know why the lognormal and exponential distribution >don't work in survreg? > >2. What else could I do to find a distribution that fits the data well? > >3. What about the non-parametric approach in survfit(), could that be a >solution? > >I hope my question aren't too stupid, as I'm not a big statistician. > >Regards, > >Geraldine > >______________________________________________ >R-help at r-project.org mailing list >https:// stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http:// www. R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062
Achim Zeileis
2008-Dec-23 21:38 UTC
[R] Interval censored Data in survreg() with zero values!
On Tue, 23 Dec 2008, Geraldine Henningsen wrote:> Hello, > > I have interval censored data, censored between (0, 100). I used the > tobit function in the AER package which in turn backs on survreg. > Actually I'm struggling with the distribution. Data is asymmetrically > distributed, so first choice would be a Weibull distribution. > Unfortunately the Weibull doesn't allow for zero values in time data, > as it requires x > 0. So I tried the exponential distribution that > allows x to be >= 0 and the log-normal that sets x <= 0 to 0. Still I > get the same error: > > " Fehler in survreg(formula = Surv(ifelse(A16_1_1 >= 100, 100, > ifelse(A16_1_1 <= : > Invalid survival times for this distribution " > > The only distributions that seem to work are gaussian and logistic, but > they don't really fit the data. > I searched for this problem in the archive and found a suggestion by > Terry Therneau to set all 0 to NA, applying Weibull afterwards. But > this solution is not very satisfying as it eliminates the left censored > data from the dataset. > > So I have three questions: > > 1. Does anybody know why the lognormal and exponential distribution > don't work in survreg?For these distributions, observations left-censored at zero are rather unlikely to occur: pexp(0) = plnorm(0) = 0.> 2. What else could I do to find a distribution that fits the data well? > > 3. What about the non-parametric approach in survfit(), could that be a > solution?Both probably depend on the questions you want to ask about your data. For the tools implemented in "survival", the "Modeling Survival Data" book by Therneau and Grambsch is the natural reference. hth, Z> I hope my question aren't too stupid, as I'm not a big statistician. > > Regards, > > Geraldine > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
Terry Therneau
2008-Dec-24 16:50 UTC
[R] Interval censored Data in survreg() with zero values!
The problem is that you are not coding your data the way that I would; program authors do not always anticipate what others will do! The Weibull distribution has support on (0, infinity). Using Surv(t1, t2, type='interval2'), you can have a left censored observation where time of event < t: represented as (NA, t) a right censored observation where time of event >t: represented as (t, NA) an interval censored observations t1<=time <= t2 : represented as (t1,t2) Notice that the NA is just a trick of representation, it does not cause something to be omitted from the data. The survreg code assumes that the third form will only be used when t1 and t2 are strictly within the range of the data. (Which is how I would do it, so is of course the OBVIOUS thing anyone would do :-) For a Weibull with event between 0 and t the right-censored representation (NA,t) and the interval censored representation (0,t) are mathematically equivalent, but the program doesn't like the second form. This is on my list of enhancements to add. It may even get to the top of the list someday. Terry Therneau --------------------- Hello, I have interval censored data, censored between (0, 100). I used the tobit function in the AER package which in turn backs on survreg. Actually I'm struggling with the distribution. Data is asymmetrically distributed, so first choice would be a Weibull distribution. Unfortunately the Weibull doesn't allow for zero values in time data, as it requires x > 0. So I tried the exponential distribution that allows x to be >= 0 and the log-normal that sets x <= 0 to 0. Still I get the same error: " Fehler in survreg(formula = Surv(ifelse(A16_1_1 >= 100, 100, ifelse(A16_1_1 <= : Invalid survival times for this distribution "
Terry Therneau
2008-Dec-29 14:17 UTC
[R] Interval censored Data in survreg() with zero values!
--begin included ----- My endogenous variable is not a time depending variable but percentages which naturally are censored in the interval [0,100]. Unfortunately many data points are 0 or 100 exactly. The rest of the data is asymmetrically distributed. So I would like to apply a two-limit tobit, regressing the percentage (endogenous variable) on several explanatory variables. --- end included ---- Censoring is a limit in the observation process: right censored at 100 means that "the true y value is > 100, but we did not observe the exact value". You have binomial data with 0 <= y <= 100, which is not a constraint on the observation process. You should be using glm with a binomial family. Terry T
Geraldine Henningsen
2009-Jan-06 16:37 UTC
[R] Interval censored Data in survreg() with zero values!
Terry Therneau schrieb:> Apologies -- you are being more subtle than I thought. Nevertheless, I think > that the censoring language isn't quite right. > > You are thinking of a hierarchical model: > > z ~ N(Xb, sigma), where Xb is the linear predictor, whatever covariates you > think belong in the model. Whether the distribution should be Gaussian or > somthing else depends not on the overall distribution of z, but on distribution > of (z | Xb). We could have a skewed predictor leading to skewed z, even if the > distribution about any given expectation is symmetric. > > y = F(z) is what you observe. The classic tobin model is y= max(0,z), which > does lead to censored data. > > In your case y_i = Binomial(n_i, p_i = H(z)). Note a binomial is k heads > out of n tries with a coin of probability p, a "Bernouli" is a binomial > restricted to a single coin flip. From the way you wrote the problem I assumed > that there is some number of n "looks" at the subject and then you count them > up. Note that var(y) = n p (1-p) > > H describes how the probability changes with z. In biology we very rarely > use H(z)= max(min(z,1),0) because it gives a hard threshold, and the probability > of nearly anything doesn't go all the way to zero or one. > > If H were as above and > var(y) = constant and > n is sufficiently large so that Binomial dist is approx Gaussian and > var(y |p) << var(z| Xb) > > then your y will fit a censored Gaussian. Since at least the second is false, > it doesn't. > > A censored model may still be an ok first cut at fitting the data, but I > would be suspicious of variance estimates and particularly of any p-values. The > bootstrap could help that. > > Terry T. > > > >@ Terry: thank you very much for the extended explanation. I will try out your suggestion. Geraldine
Geraldine Henningsen
2009-Jan-12 16:29 UTC
[R] Interval censored Data in survreg() with zero values!
Hello again, I studied your suggestion but still I disagree. You wrote: "From the way you wrote the problem I assumed that there is some number of n "looks" at the subject and then you count them up." But this is not the case. My data is clearly continuous quantities and no discrete choices. I know nothing about the underlying choice process, the only thing I know is the final share of one of three regimes. So sorry for the bad description of the problem. So I stick with my censored data model. Still the hint about the p-values is very helpful because I actually ran into this problem. So thank you for the hint. Best, Geraldine Terry Therneau schrieb:> Apologies -- you are being more subtle than I thought. Nevertheless, I think > that the censoring language isn't quite right. > > You are thinking of a hierarchical model: > > z ~ N(Xb, sigma), where Xb is the linear predictor, whatever covariates you > think belong in the model. Whether the distribution should be Gaussian or > somthing else depends not on the overall distribution of z, but on distribution > of (z | Xb). We could have a skewed predictor leading to skewed z, even if the > distribution about any given expectation is symmetric. > > y = F(z) is what you observe. The classic tobin model is y= max(0,z), which > does lead to censored data. > > In your case y_i = Binomial(n_i, p_i = H(z)). Note a binomial is k heads > out of n tries with a coin of probability p, a "Bernouli" is a binomial > restricted to a single coin flip. From the way you wrote the problem I assumed > that there is some number of n "looks" at the subject and then you count them > up. Note that var(y) = n p (1-p) > > H describes how the probability changes with z. In biology we very rarely > use H(z)= max(min(z,1),0) because it gives a hard threshold, and the probability > of nearly anything doesn't go all the way to zero or one. > > If H were as above and > var(y) = constant and > n is sufficiently large so that Binomial dist is approx Gaussian and > var(y |p) << var(z| Xb) > > then your y will fit a censored Gaussian. Since at least the second is false, > it doesn't. > > A censored model may still be an ok first cut at fitting the data, but I > would be suspicious of variance estimates and particularly of any p-values. The > bootstrap could help that. > > Terry T. > > > >
Reasonably Related Threads
- Survreg(), Surv() and interval-censored data
- survreg() provides same results with different distirbutions for left censored data
- Survreg(), Surv() and interval-censored data
- Censored or truncated Regression Models/Tobit
- How to generate a random variate that is correlated with a given right-censored random variate?