Patrick Foley
2004-Dec-05 00:49 UTC
[R] What is the most useful way to detect nonlinearity in logistic regression?
It is easy to spot response nonlinearity in normal linear models using plot(something.lm). However plot(something.glm) produces artifactual peculiarities since the diagnostic residuals are constrained by the fact that y can only take values 0 or 1. What do R users find most useful in checking the linearity assumption of logistic regression (i.e. log-odds =a+bx)? Patrick Foley patfoley at csus.edu
Peter Dalgaard
2004-Dec-05 02:50 UTC
[R] What is the most useful way to detect nonlinearity in logistic regression?
Patrick Foley <patfoley at csus.edu> writes:> It is easy to spot response nonlinearity in normal linear models using > plot(something.lm). > However plot(something.glm) produces artifactual peculiarities since > the diagnostic residuals are constrained by the fact that y can only > take values 0 or 1. > What do R users find most useful in checking the linearity assumption > of logistic regression (i.e. log-odds =a+bx)?Well, there's basically - grouping - higher-order terms - smoothed residuals A simple technique is to include a variable _both_ as a continuous term and cut up into a factor (as in ~ age + cut(age,seq(30,70,10))). The model that you are fitting is a bit weird but it gives you a clean test for omitting the grouped term. A somewhat nicer variant of the same theme is to do a linear spline (or a higher order one for that matter) with selected knots. Re. the smoothed residuals, you do need to be careful about the smoother. Some of the "robust" ones will do precisely the wrong thing in this context: You really are interested in the mean, not some trimmed mean (which can easily amount to throwing away all your cases...). Here's an idea: x <- runif(500) y <- rbinom(500,size=1,p=plogis(x)) xx <- predict(loess(resid(glm(y~x,binomial))~x),se=T) matplot(x,cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit),pch=20) Not sure my money isn't still on the splines, though. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
John Fox
2004-Dec-05 04:41 UTC
[R] What is the most useful way to detect nonlinearity in logisticregression?
Dear Patrick, Component+residual plots can be defined for generalized linear models (including logistic regression) as for linear models, but they may require smoothing for interpretation. See, e.g., the cr.plots() functions in the car package, which works with glm objects. I hope this helps, John -------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Patrick Foley > Sent: Saturday, December 04, 2004 7:49 PM > To: r-help at stat.math.ethz.ch > Subject: [R] What is the most useful way to detect > nonlinearity in logisticregression? > > It is easy to spot response nonlinearity in normal linear > models using plot(something.lm). > However plot(something.glm) produces artifactual > peculiarities since the diagnostic residuals are constrained > by the fact that y can only take values 0 or 1. > What do R users find most useful in checking the linearity > assumption of logistic regression (i.e. log-odds =a+bx)? > > Patrick Foley > patfoley at csus.edu > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
(Ted Harding)
2004-Dec-05 12:59 UTC
[R] What is the most useful way to detect nonlinearity in lo
On 05-Dec-04 Patrick Foley wrote:> It is easy to spot response nonlinearity in normal linear > models using plot(something.lm). > However plot(something.glm) produces artifactual peculiarities > since the diagnostic residuals are constrained by the fact > that y can only take values 0 or 1. > What do R users find most useful in checking the linearity > assumption of logistic regression (i.e. log-odds =a+bx)? > > Patrick Foley > patfoley at csus.eduThe "most useful way to detect nonlinearity in logistic regression" is: a) have an awful lot of data b) have the x (covariate) values judiciously placed. Don't be optimistic about this prohlem. The amount of information, especially about non-linearity, in the binary responses is often a lot less than people intuitively expect. This is an area where R can be especially useful for self-education by exploring possibilities and simulation. For example, define the function (for quadratic nonlinearity): testlin2<-function(a,b,N){ x<-c(-1.0,-0.5,0.0,0.5,1.0) lp<-a*x+b*x^2; p<-exp(lp)/(1+exp(lp)) n<-N*c(1,1,1,1,1) r<-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]), rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]), rbinom(1,n[5],p[5]) ) resp<-cbind(r,n-r) X<-cbind(x,x^2);colnames(X)<-c("x","x2") summary(glm(formula = resp ~ X - 1, family = binomial),correlation=TRUE) } This places N observations at each of (-1.0,0.5,0.0.5,1.0), generates the N binary responses with probability p(x) where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression forcing the "intercept" term to be 0 (so that you're not diluting the info by estimating a parameter you know to be 0), and returns the summary(glm(...)) from which the p-values can be extracted: The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]} You can run this function as a one-off for various values of a, b, N to get a feel for what happens. You can run a simulation on the lines of pvals<-numeric(1000); for(i in (1:1000)){ pvals[i]<-testlin2(1,0.1,500)$coefficients[2,4] } so that you can test how often you get a "significant" result. For example, adopting the ritual "sigificant == P<0.05, power = 80%", you can see a histogram of the p-values over the conventional "significance breaks" with hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) and you can see your probability of getting a "significant" result as e.g. sum(pvals < 0.05)/1000 I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1 corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly interesting degree of nonlinearity), I had to go up to N=2000 before I was getting more than 80% of the p-values < 0.05. That corresponds to 2000 observations at each value of x, or 10,000 observations in all. Compare this with a similar test for non-linearity with normally-distributed responses [exercise for the reader]. You can write functions similar to testlin2 for higher-order nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for a*x + b*x^2 + c*x^3, etc., (the modifications required are obvious) and see how you get on. As I say, don't be optimistic! In particular, run testlin3 a few times and see the sort of mess that can come out -- in particular gruesome correlations, which is why "correlation=TRUE" is set in the call to summary(glm(...),correlation=TRUE). Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 05-Dec-04 Time: 12:59:08 ------------------------------ XFMail ------------------------------
(Ted Harding)
2004-Dec-05 13:54 UTC
[R] What is the most useful way to detect nonlinearity in lo
On 05-Dec-04 Ted Harding wrote:> [...] > For example, adopting the ritual "sigificant == P<0.05, > power = 80%", you can see a histogram of the p-values > over the conventional "significance breaks" with > > hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)Sorry for the typo! That should of course (if you were looking closely) have been hist(pvals,breaks=c(0,0.01,0.05,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) Apologies to anyone who was misled by pasting in the bad version.> and you can see your probability of getting a "significant" result > as e.g. sum(pvals < 0.05)/1000 > [...]Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 05-Dec-04 Time: 13:54:11 ------------------------------ XFMail ------------------------------
Seemingly Similar Threads
- estimation of drift of continuous random walk
- What is the most useful way to detect nonlinearity in lo
- Random factor ANOVA, Repeated measures ANOVA, Within subjects designs.
- Lattice + Word: Changing .wmf files to .pdf files
- Help - Curvature measures of nonlinearity