HelponR
2006-Apr-03 10:58 UTC
[R] Does logistic regression require the independence of samples?
Dear list: Thanks a lot for help. I have a question and I could not find clear answers easily. When we do logistic regression for one type of events of interest as a proportion of a broader types of events, does the logistic regression assume that the number of whole types of events should be independent with the number of type of interest? For example, if one type of events and the whole type of events are two time series of count number, but they vary in a same fashion (both increase or decrease with time), can we still use logistic regression to figure out the time's effect on proportion? If not, what is right thing to do? Many thanks, [[alternative HTML version deleted]]
Ben Bolker
2006-Apr-03 18:17 UTC
[R] Does logistic regression require the independence of samples?
HelponR <suncertain <at> gmail.com> writes:> > Dear list: > > Thanks a lot for help. I have a question and I could not find clear answers > easily. > > When we do logistic regression for one type of events of interest as a > proportion of a broader types of events, does the logistic regression assume > that the number of whole types of events should be independent with the > number of type of interest? > > For example, if one type of events and the whole type of events are two time > series of count number, but they vary in a same fashion (both increase or > decrease with time), can we still use logistic regression to figure out the > time's effect on proportion? If not, what is right thing to do? >The answer to the general question in the subject is "no" (logistic regression will fail if observations are correlated), but I think in this particular case that it's OK; for a multinomial sample, the numbers of each type are binomial conditional on the total number of all types. I ran a numerical experiment to see if the standard errors were appropriate for a simple example of this type (if the correlation were going to screw something up it would be likely to be the standard errors/confidence intervals rather than the point estimates): dosim <- function() { time <- sort(runif(200)) nevents <- rpois(200,10*time) type <- rbinom(200,size=nevents,prob=plogis(10*(time-0.5))) evmat <- cbind(type,nevents-type) m1 <- glm(evmat~time,family="binomial") coef(summary(m1))["time",] } set.seed(1001) r1 <- replicate(1000,dosim()) rownames(r1) true <- 10 cover <- (r1["Estimate",]<true+1.96*r1["Std. Error",] & r1["Estimate",]>true-1.96*r1["Std. Error",]) sum(cover)/1000 the answer came out to 0.941, which seems reasonable ... I'm hoping/figuring that someone more knowledgeable will jump in with corrections if I've said something terribly wrong ... Ben Bolker