thr3ads.net - R help - [R] What is the most useful way to detect nonlinearity in logistic regression? [Dec 2004]

If this information is useful, please help other people find it:
Share via:

Patrick Foley

2004-Dec-05 00:49 UTC

[R] What is the most useful way to detect nonlinearity in logistic regression?

It is easy to spot response nonlinearity in normal linear models using 
plot(something.lm).
However plot(something.glm) produces artifactual peculiarities since the 
diagnostic residuals are constrained  by the fact that y can only take 
values 0 or 1.
What do R users find most useful in checking the linearity assumption of 
logistic regression (i.e. log-odds =a+bx)?

Patrick Foley
patfoley at csus.edu

Peter Dalgaard

2004-Dec-05 02:50 UTC

head link

[R] What is the most useful way to detect nonlinearity in logistic regression?

Patrick Foley <patfoley at csus.edu> writes:
> It is easy to spot response nonlinearity in normal linear models using
> plot(something.lm).
> However plot(something.glm) produces artifactual peculiarities since
> the diagnostic residuals are constrained  by the fact that y can only
> take values 0 or 1.
> What do R users find most useful in checking the linearity assumption
> of logistic regression (i.e. log-odds =a+bx)?
Well, there's basically

        - grouping
        - higher-order terms
        - smoothed residuals

A simple technique is to include a variable _both_ as a continuous
term and cut up into a factor (as in ~ age + cut(age,seq(30,70,10))).
The model that you are fitting is a bit weird but it gives you a clean
test for omitting the grouped term. A somewhat nicer variant of the
same theme is to do a linear spline (or a higher order one for that
matter) with selected knots.

Re. the smoothed residuals, you do need to be careful about the
smoother. Some of the "robust" ones will do precisely the wrong thing
in this context: You really are interested in the mean, not some
trimmed mean (which can easily amount to throwing away all your
cases...). Here's an idea:

x <- runif(500)
y <- rbinom(500,size=1,p=plogis(x))
xx <- predict(loess(resid(glm(y~x,binomial))~x),se=T)
matplot(x,cbind(xx$fit, 2*xx$se.fit, -2*xx$se.fit),pch=20)

Not sure my money isn't still on the splines, though.

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907

John Fox

2004-Dec-05 04:41 UTC

head link

[R] What is the most useful way to detect nonlinearity in logisticregression?

Dear Patrick,

Component+residual plots can be defined for generalized linear models
(including logistic regression) as for linear models, but they may require
smoothing for interpretation. See, e.g., the cr.plots() functions in the car
package, which works with glm objects.

I hope this helps,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
-------------------------------- 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Patrick Foley
> Sent: Saturday, December 04, 2004 7:49 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] What is the most useful way to detect 
> nonlinearity in logisticregression?
> 
> It is easy to spot response nonlinearity in normal linear 
> models using plot(something.lm).
> However plot(something.glm) produces artifactual 
> peculiarities since the diagnostic residuals are constrained  
> by the fact that y can only take values 0 or 1.
> What do R users find most useful in checking the linearity 
> assumption of logistic regression (i.e. log-odds =a+bx)?
> 
> Patrick Foley
> patfoley at csus.edu
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html

(Ted Harding)

2004-Dec-05 12:59 UTC

head link

[R] What is the most useful way to detect nonlinearity in lo

On 05-Dec-04 Patrick Foley wrote:> It is easy to spot response nonlinearity in normal linear
> models using plot(something.lm).
> However plot(something.glm) produces artifactual peculiarities
> since the diagnostic residuals are constrained  by the fact
> that y can only take values 0 or 1.
> What do R users find most useful in checking the linearity
> assumption of logistic regression (i.e. log-odds =a+bx)?
> 
> Patrick Foley
> patfoley at csus.edu
The "most useful way to detect nonlinearity in logistic
regression" is:

a) have an awful lot of data
b) have the x (covariate) values judiciously placed.

Don't be optimistic about this prohlem. The amount of
information, especially about non-linearity, in the binary
responses is often a lot less than people intuitively expect.

This is an area where R can be especially useful for
self-education by exploring possibilities and simulation.

For example, define the function (for quadratic nonlinearity):

  testlin2<-function(a,b,N){
    x<-c(-1.0,-0.5,0.0,0.5,1.0)
    lp<-a*x+b*x^2; p<-exp(lp)/(1+exp(lp))
    n<-N*c(1,1,1,1,1)
    r<-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]),
         rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]),
         rbinom(1,n[5],p[5])
        )
    resp<-cbind(r,n-r)
    X<-cbind(x,x^2);colnames(X)<-c("x","x2")
    summary(glm(formula = resp ~ X - 1,
            family = binomial),correlation=TRUE)
  }

This places N observations at each of (-1.0,0.5,0.0.5,1.0),
generates the N binary responses with probability p(x)
where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression
forcing the "intercept" term to be 0 (so that you're not
diluting the info by estimating a parameter you know to be 0),
and returns the summary(glm(...)) from which the p-values
can be extracted:

The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]}

You can run this function as a one-off for various values of
a, b, N to get a feel for what happens. You can run a simulation
on the lines of

  pvals<-numeric(1000);
  for(i in (1:1000)){
    pvals[i]<-testlin2(1,0.1,500)$coefficients[2,4]
  }

so that you can test how often you get a "significant" result.

For example, adopting the ritual "sigificant == P<0.05,
power = 80%", you can see a histogram of the p-values
over the conventional "significance breaks" with

hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)

and you can see your probability of getting a "significant" result
as e.g. sum(pvals < 0.05)/1000

I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1
corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly
interesting degree of nonlinearity), I had to go up to N=2000
before I was getting more than 80% of the p-values < 0.05.
That corresponds to 2000 observations at each value of x, or
10,000 observations in all.

Compare this with a similar test for non-linearity with
normally-distributed responses [exercise for the reader].

You can write functions similar to testlin2 for higher-order
nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for
a*x + b*x^2 + c*x^3, etc., (the modifications required are
obvious) and see how you get on. As I say, don't be optimistic!

In particular, run testlin3 a few times and see the sort of
mess that can come out -- in particular gruesome correlations,
which is why "correlation=TRUE" is set in the call to
summary(glm(...),correlation=TRUE).

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 05-Dec-04                                       Time: 12:59:08
------------------------------ XFMail ------------------------------

(Ted Harding)

2004-Dec-05 13:54 UTC

head link

[R] What is the most useful way to detect nonlinearity in lo

On 05-Dec-04 Ted Harding wrote:> [...]
> For example, adopting the ritual "sigificant == P<0.05,
> power = 80%", you can see a histogram of the p-values
> over the conventional "significance breaks" with
> 
> hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)
Sorry for the typo! That should of course (if you were looking
closely) have been

hist(pvals,breaks=c(0,0.01,0.05,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE)

Apologies to anyone who was misled by pasting in the bad version.
> and you can see your probability of getting a "significant"
result
> as e.g. sum(pvals < 0.05)/1000
> [...]
Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 05-Dec-04                                       Time: 13:54:11
------------------------------ XFMail ------------------------------

Seemingly Similar Threads

Search for more reasonably related threads

R help - Dec 2004 - What is the most useful way to detect nonlinearity in logistic regression?

[R] What is the most useful way to detect nonlinearity in logistic regression?

[R] What is the most useful way to detect nonlinearity in logistic regression?

[R] What is the most useful way to detect nonlinearity in logisticregression?

[R] What is the most useful way to detect nonlinearity in lo

[R] What is the most useful way to detect nonlinearity in lo

Seemingly Similar Threads