anthonywaldron
2012-May-30 14:50 UTC
[R] Survival with different probabilities of censoring
Dear all I have a fairly funky problem that I think demands some sort of survival analysis. There are two Red List assessments for mammals: 1986 and 2008. Some mammals changed their Red List status between those dates. Those changes can be regarded as "events" and are "interval censored" in the sense that we don't know at what point between 1986 and 2008 each species declined far enough to move into another category of extinction risk. We then allocate fractional responsibility for each decline among the countries of the world and attempt to model factors in each country that might cause the species declines. For example, if a declining species is found in two countries and we decide that the countries share 50:50 responsibility for the decline, then the blame score of each of those countries gets augmented by 0.50 "species fractions". The data set therefore looks like: Y variable: a set of non-integer values representing "blame scores", being the sum of fractions of status-changing species for which each country should be blamed. Note that these are mostly changes to worse status but some are changes to the better i.e. negative values. There are also a lot of zeroes, for countries where there is no species that changed status. X variables: various things like governance, population density etc, The multivariate analysis also includes the total number of species fractions in each country (not just the species fractions that changed status). The latter term controls for the influence of total species richness on the number of species experiencing the event. Please note that total species richness also influences some of the other x variables, so it is included as an x term and should not be used as a simple scalor for the y term. The fun part: The data are double censored. As I said, they might be considered interval censored. The countries with zero species fractions changing status in 2008 are right censored (all species do indeed eventually die). However, and this is the big problem, the probability of a zero is very different for each country. Countries with very few species fractions are far more likely to have zeroes i.e. to be right censored. They are therefore far less informative regarding the influence of the x variables. Indeed, there is a very clear pattern whereby, if I run a normal regression on data that excludes the zeroes, I get a statistical expectation at y=0. If I then include the empirical y=0 values, the values that depart furthest from this expectation are exactly the ones that have the highest probability of being a zero at random (big residuals patterned in an S shape on the fitted values, as you would instinctively imagine will occur when plotting a range of y=0 points on a sloping regression line) . Zeroes which are rather UNLIKELY to be zero at random represent countries where you would expect a species delince AND NONE HAPPENED, and those countries sit very close to the expectation. Countries effectively get further and further below the ability of my "instrument" to detect an effect as they become less and less species rich, until there are so few species that the probability of observing any event becomes tiny. Countries where probability of an event being observed are tiny sit furthest from the expectation. If all countries are given equal weight, therefore, the noise from species-poor countries all but obscures the signal. We've tried various approaches for zero-heavy data but I think this increasingly looks like a survival analysis to me. The question is, how can I adapt a survival analysis so that it takes into account the different probability of censoring (the different random probability of being a zero) and downweights the uninformative zeroes? (With the added fun of double censoring, non-integer values and a small number of negative values). Remember that what we have is the number of species fractions changing status in a single time period, not the more usual "time to event". Please also remember: although we can calculate the /percent /of species fractions in each country that changed status, we can't really use percentage as the y variable because the denominator (total fractional species richness) also affects the x variables. We therefore need to use the raw number of species fractions changing status. I'm hoping that somebody experienced in survival analysis might have come across something like this before (including how to deal with the multiple censoring, the non-integer values and importantly, the different informativeness of each censored data point). best regards Anthony Waldron Universidade de Santa Cruz, Bahia, Brazil -- View this message in context: http://r.789695.n4.nabble.com/Survival-with-different-probabilities-of-censoring-tp4631838.html Sent from the R help mailing list archive at Nabble.com.