thr3ads.net - R help - [R] OT: (quasi-?) separation in a logistic GLM [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Gavin Simpson

2008-Dec-15 18:03 UTC

[R] OT: (quasi-?) separation in a logistic GLM

Dear List,

Apologies for this off-topic post but it is R-related in the sense that
I am trying to understand what R is telling me with the data to hand.

ROC curves have recently been used to determine a dissimilarity
threshold for identifying whether two samples are from the same "type"
or not. Given the bashing that ROC curves get whenever anyone asks about
them on this list (and having implemented the ROC methodology in my
analogue package) I wanted to try directly modelling the probability
that two sites are analogues for one another for given dissimilarity
using glm().

The data I have then are a logical vector ('analogs') indicating whether
the two sites come from the same vegetation and a vector of the
dissimilarity between the two sites ('Dij'). These are in a csv file
currently in my university web space. Each 'row' in this file
corresponds to single comparison between 2 sites.

When I analyse these data using glm() I get the familiar "fitted
probabilities numerically 0 or 1 occurred" warning. The data do not look
linearly separable when plotted (code for which is below). I have read
Venables and Ripley's discussion of this in MASS4 and other sources that
discuss this warning and R (Faraway's Extending the Linear Model with R
and John Fox's new Applied Regression, Generalized Linear Models, and
Related Methods, 2nd Ed) as well as some of the literature on Firth's
bias reduction method. But I am still somewhat unsure what
(quasi-)separation is and if this is the reason for the warnings in this
case.

My question then is, is this a separation issue with my data, or is it
quasi-separation that I have read a bit about whilst researching this
problem? Or is this something completely different?

Code to reproduce my problem with the actual data is given below. I'd
appreciate any comments or thoughts on this.

#### Begin code snippet ################################################

## note data file is ~93Kb in size
dat <-
read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/dat.csv"))
head(dat)
## fit model --- produces warning
mod <- glm(analogs ~ Dij, data = dat, family = binomial)
## plot the data
plot(analogs ~ Dij, data = dat)
fit.mod <- fitted(mod)
ord <- with(dat, order(Dij))
with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))

#### End code snippet ##################################################

Thanks in advance

Gavin
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

David Winsemius

2008-Dec-16 02:10 UTC

head link

[R] OT: (quasi-?) separation in a logistic GLM

If you look at the distribution of those with analogs==TRUE versus the  
whole groups it is not surprising that the upper range of Dij's result  
in a very low probability estimate:

  plot(density(dat$Dij))
lines(density(dat[dat$analogs == TRUE, 2]))

Appears as though more than 25% of the predict(mod, type="response")  
elements will satisfy
all.equal(0, <predict>) which on my machine occurs when <predict> is
(roughly)  below 1e-9


Best;

David Winsemius

On Dec 15, 2008, at 1:03 PM, Gavin Simpson wrote:
> Dear List,
>
> Apologies for this off-topic post but it is R-related in the sense  
> that
> I am trying to understand what R is telling me with the data to hand.
>
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same
"type"
> or not. Given the bashing that ROC curves get whenever anyone asks  
> about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
>
> The data I have then are a logical vector ('analogs') indicating  
> whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv
file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
>
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not  
> look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources  
> that
> discuss this warning and R (Faraway's Extending the Linear Model  
> with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in  
> this
> case.
>
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
>
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
>
> #### Begin code snippet  
> ################################################
>
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/ 
> dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
>
> #### End code snippet  
> ##################################################
>
> Thanks in advance
>
> Gavin
> -- 
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

vito muggeo

2008-Dec-16 12:31 UTC

head link

[R] OT: (quasi-?) separation in a logistic GLM

dear Gavin,
I do not know whether such comment may be still useful..

Why are you unsure about quasi-separation?
I think that it is quite evident in the plot

plot(analogs ~ Dij, data = dat)

Also it may be useful to see the plot of the monotone (profile) deviance 
(or the log-lik) for the coef of Dij,

xval<-seq(-20,0,l=50)
ll<-vector(length=50)
for(i in 1:length(xval)){
mod <- glm(analogs ~ offset(xval[i]*Dij), data = dat, family = binomial)
ll[i]<-mod$dev
}

plot(xval, ll)

Hope this helps you,

vito

Gavin Simpson ha scritto:> Dear List,
> 
> Apologies for this off-topic post but it is R-related in the sense that
> I am trying to understand what R is telling me with the data to hand.
> 
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same
"type"
> or not. Given the bashing that ROC curves get whenever anyone asks about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
> 
> The data I have then are a logical vector ('analogs') indicating
whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv
file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
> 
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not
look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources
that
> discuss this warning and R (Faraway's Extending the Linear Model with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in this
> case.
> 
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
> 
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
> 
> #### Begin code snippet ################################################
> 
> ## note data file is ~93Kb in size
> dat <-
read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
> 
> #### End code snippet ##################################################
> 
> Thanks in advance
> 
> Gavin
-- 
===================================Vito M.R. Muggeo
Dip.to Sc Statist e Matem `Vianelli'
Universit? di Palermo
viale delle Scienze, edificio 13
90128 Palermo - ITALY
tel: 091 6626240
fax: 091 485726/485612
http://dssm.unipa.it/vmuggeo

Ioannis Kosmidis

2008-Dec-16 14:20 UTC

head link

[R] OT: (quasi-?) separation in a logistic GLM

Dear Gavin,

glm reported exactly what it noticed, giving a warning that some very  
small fitted probabilities have been found.
However, your data are **not** quasi-separated. The maximum likelihood  
estimates are really those reported by glm.

A first elementary way is to change the tolerance and maximum number  
of iterations in glm and see if you get the same result:
#
 > mod1

Call:  glm(formula = analogs ~ Dij, family = binomial, data = dat,  
control = glm.control(epsilon = 1e-16,      maxit = 1000))

Coefficients:
(Intercept)          Dij
       4.191      -29.388

Degrees of Freedom: 4033 Total (i.e. Null);  4032 Residual
Null Deviance:	    1929
Residual Deviance: 613.5 	AIC: 617.5
#
This is exactly the same fit as the one you have. If separation  
occured the effects ususally diverge as we allow more iterations to  
glm and at some point.

**************
Secondly an inspection of the estimated asymptotic standard errors,  
reveals nothing to worry for.
#
 > summary(mod1)

Call:
glm(formula = analogs ~ Dij, family = binomial, data = dat, control =  
glm.control(epsilon = 1e-16,
     maxit = 1000))

Deviance Residuals:
        Min          1Q      Median          3Q         Max
-1.676e+00  -1.319e-02  -1.250e-04  -1.958e-06   4.104e+00

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)   4.1912     0.3248   12.90   <2e-16 ***
Dij         -29.3875     1.9345  -15.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

     Null deviance: 1928.62  on 4033  degrees of freedom
Residual deviance:  613.53  on 4032  degrees of freedom
AIC: 617.53

Number of Fisher Scoring iterations: 11
#
If separation occurred the estimated asymptotic standard errors would  
be unnaturally large. This is because, in the case of separation  
(quasi or not) glm would calculate the standard errors taking the sqrt  
of the diagonal elements of minus the hessian of the log-likelihood,  
in a point where the log-likelihood appears to be flat for the given  
tolerance.

**************
To be certain, you could also try fitting with brglm, which is  
guaranteed to give finite estimates, that have bias of smaller order  
than the MLE and compare the results.
#
 > library(brglm)
 > mod.br <- brglm(analogs ~ Dij, data = dat, family = binomial)
 > mod.br

Call:  brglm(formula = analogs ~ Dij, family = binomial, data = dat)

Coefficients:
(Intercept)          Dij
       4.161      -29.188

Degrees of Freedom: 4033 Total (i.e. Null);  4032 Residual
Deviance:	    613.5448
Penalized Deviance: 610.2794 	AIC: 617.5448
#
The estimates are similar a bit shrunk towards the origin which is  
natural for bias removal. If separation occurred, and given the  
previous discussion, the bias-reduced estimates would be considerably  
different than the estimates that glm reports.

**************
Lastly, the more certain way to check for separation is to inspect the  
profiles of the log-likelihood. Vito suggested this but the chosen  
limits for the xval are not appropriate. If separation would occur the  
estimate would be -Inf so that the profiling as done in his email  
should be done starting from example from -40 rather than -20. This  
would reveal that the profile deviance starts increasing again, while  
if separation occured there would be an asymptote on the left. Below I  
give the correct profiles, as reported by profileModel.
 > library(profileModel)
 > pp <- profileModel(mod1, quantile = qchisq(0.95, 1), objective =  
"ordinaryDeviance")
Preliminary iteration .. Done

Profiling for parameter (Intercept) ... Done
Profiling for parameter Dij ... Done
 > plot(pp)
The profiles are quite quadratic. In the case of separation you would  
have seen asymptotes on the left or on the right (see  
help(profileModel) for an example).

**************
It appears that the fitted logistic curve, while steep still has a  
finite gradient, for example, at the LD50 point
 > library(MASS)
 > dose.p(mod)
               Dose          SE
p = 0.5: 0.1426167 0.003646903
When separation occurs the LD50 point cannot be identified (computer  
software would return something with enormous estimated standard error).

In conclusion, if you get data sets that result in large estimated  
effects on the log-odds scale, the above checks can be used to  
convince you whether separation occurred or not. If there is  
separation (not the case in the current example) then, you could use  
an alternative to maximum likelihood for estimation ---such as  
penalized maximum likelihood in brglm--- which always return  finite  
estimates. Though in that case, I suggest you incorporate the  
uncertainty on how large the estimated effects are in having  
confidence intervals with one infinite endpoint, for example  
confidence intervals as in help(profile.brglm).

Hope this helps,

Best wishes,

Ioannis


On 15 Dec 2008, at 18:03, Gavin Simpson wrote:
> Dear List,
>
> Apologies for this off-topic post but it is R-related in the sense  
> that
> I am trying to understand what R is telling me with the data to hand.
>
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same
"type"
> or not. Given the bashing that ROC curves get whenever anyone asks  
> about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
>
> The data I have then are a logical vector ('analogs') indicating  
> whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv
file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
>
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not  
> look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources  
> that
> discuss this warning and R (Faraway's Extending the Linear Model  
> with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in  
> this
> case.
>
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
>
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
>
> #### Begin code snippet  
> ################################################
>
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/ 
> dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
>
> #### End code snippet  
> ##################################################
>
> Thanks in advance
>
> Gavin
> -- 
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
------------------------------------------
Ioannis Kosmidis
D0.05, Dept. of Statistics,
University of Warwick,
Coventry, CV4 7AL, UK

Webpage: http://go.warwick.ac.uk/kosmidis
Voice: +44(0)2476 150778
Fax:  +44(0)2476 524532
------------------------------------------


	[[alternative HTML version deleted]]

Ioannis Kosmidis

2008-Dec-16 15:22 UTC

head link

[R] OT: (quasi-?) separation in a logistic GLM

sorry for reposting. Some code was missing in my previous email...

--------------------------------------------------
Dear Gavin

glm reported exactly what it noticed, giving a warning that some very  
small fitted probabilities have been found.
However, your data are **not** quasi-separated. The maximum likelihood  
estimates are really those reported by glm.

A first elementary way is to change the tolerance and maximum number  
of iterations in glm and see if you get the same result:
#
 > mod1 <- glm(formula = analogs ~ Dij, family = binomial, data = dat,  
control = glm.control(epsilon = 1e-16,  maxit = 1000))
 > mod1

Call:  glm(formula = analogs ~ Dij, family = binomial, data = dat,  
control = glm.control(epsilon = 1e-16,      maxit = 1000))

Coefficients:
(Intercept)          Dij
       4.191      -29.388

Degrees of Freedom: 4033 Total (i.e. Null);  4032 Residual
Null Deviance:	    1929
Residual Deviance: 613.5 	AIC: 617.5
#
This is exactly the same fit as the one you have. If separation  
occured the effects ususally diverge as we allow more iterations to  
glm and at some point.

**************
Secondly an inspection of the estimated asymptotic standard errors,  
reveals nothing to worry for.
#
 > summary(mod1)

Call:
glm(formula = analogs ~ Dij, family = binomial, data = dat, control =  
glm.control(epsilon = 1e-16,
     maxit = 1000))

Deviance Residuals:
        Min          1Q      Median          3Q         Max
-1.676e+00  -1.319e-02  -1.250e-04  -1.958e-06   4.104e+00

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)   4.1912     0.3248   12.90   <2e-16 ***
Dij         -29.3875     1.9345  -15.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

     Null deviance: 1928.62  on 4033  degrees of freedom
Residual deviance:  613.53  on 4032  degrees of freedom
AIC: 617.53

Number of Fisher Scoring iterations: 11
#
If separation occurred the estimated asymptotic standard errors would  
be unnaturally large. This is because, in the case of separation  
(quasi or not) glm would calculate the standard errors taking the sqrt  
of the diagonal elements of minus the hessian of the log-likelihood,  
in a point where the log-likelihood appears to be flat for the given  
tolerance.

**************
To be certain, you could also try fitting with brglm, which is  
guaranteed to give finite estimates, that have bias of smaller order  
than the MLE and compare the results.
#
 > library(brglm)
 > mod.br <- brglm(analogs ~ Dij, data = dat, family = binomial)
 > mod.br

Call:  brglm(formula = analogs ~ Dij, family = binomial, data = dat)

Coefficients:
(Intercept)          Dij
       4.161      -29.188

Degrees of Freedom: 4033 Total (i.e. Null);  4032 Residual
Deviance:	    613.5448
Penalized Deviance: 610.2794 	AIC: 617.5448
#
The estimates are similar a bit shrunk towards the origin which is  
natural for bias removal. If separation occurred, and given the  
previous discussion, the bias-reduced estimates would be considerably  
different than the estimates that glm reports.

**************
Lastly, the more certain way to check for separation is to inspect the  
profiles of the log-likelihood. Vito suggested this but the chosen  
limits for the xval are not appropriate. If separation would occur the  
estimate would be -Inf so that the profiling as done in his email  
should be done starting from example from -40 rather than -20. This  
would reveal that the profile deviance starts increasing again, while  
if separation occured there would be an asymptote on the left. Below I  
give the correct profiles, as reported by profileModel.
 > library(profileModel)
 > pp <- profileModel(mod1, quantile = qchisq(0.95, 1), objective =  
"ordinaryDeviance")
Preliminary iteration .. Done

Profiling for parameter (Intercept) ... Done
Profiling for parameter Dij ... Done
 > plot(pp)
The profiles are quite quadratic. In the case of separation you would  
have seen asymptotes on the left or on the right (see  
help(profileModel) for an example).

**************
It appears that the fitted logistic curve, while steep still has a  
finite gradient, for example, at the LD50 point
 > library(MASS)
 > dose.p(mod)
               Dose          SE
p = 0.5: 0.1426167 0.003646903
When separation occurs the LD50 point cannot be identified (computer  
software would return something with enormous estimated standard error).

In conclusion, if you get data sets that result in large estimated  
effects on the log-odds scale, the above checks can be used to  
convince you whether separation occurred or not. If there is  
separation (not the case in the current example) then, you could use  
an alternative to maximum likelihood for estimation ---such as  
penalized maximum likelihood in brglm--- which always return  finite  
estimates. Though in that case, I suggest you incorporate the  
uncertainty on how large the estimated effects are in having  
confidence intervals with one infinite endpoint, for example  
confidence intervals as in help(profile.brglm).

Hope this helps,

Best wishes,

Ioannis


On 15 Dec 2008, at 18:03, Gavin Simpson wrote:> Dear List,
>
> Apologies for this off-topic post but it is R-related in the sense  
> that
> I am trying to understand what R is telling me with the data to hand.
>
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same
"type"
> or not. Given the bashing that ROC curves get whenever anyone asks  
> about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
>
> The data I have then are a logical vector ('analogs') indicating  
> whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv
file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
>
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not  
> look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources  
> that
> discuss this warning and R (Faraway's Extending the Linear Model  
> with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in  
> this
> case.
>
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
>
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
>
> #### Begin code snippet  
> ################################################
>
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/ 
> dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
>
> #### End code snippet  
> ##################################################
>
> Thanks in advance
>
> Gavin
> -- 
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
------------------------------------------
Ioannis Kosmidis
D0.05, Dept. of Statistics,
University of Warwick,
Coventry, CV4 7AL, UK

Webpage: http://go.warwick.ac.uk/kosmidis
Voice: +44(0)2476 150778
Fax:  +44(0)2476 524532
------------------------------------------


	[[alternative HTML version deleted]]

Grant Izmirlian

2008-Dec-17 16:44 UTC

head link

[R] OT: (quasi-?) separation in a logistic GLM

Gavin:

I think its important to point out two probably obvious things
(1) the dataset is very imbalanced...we have an overabundance of 
'analogs==FALSE' roughly 94% of the data.  If we think of a purely 
non-parametric test of equality of the underlying CDF's then we have 
alot of confidence in F0 and not much in F1

(2) this aside, it does appear that the Dij value 0.209 seems to be 
   the optimum from the standpoint of maximizing Youdin's index
  Se + Sp - 1   which is the expected utility assigning utilities +/- 1/pi and 
+/- 1/(1-pi) to True/False positives and negatives...
meaning in this case, a true/false positive is worth 0.94/0.06 the
value of a true/false negative which seems reasonable given the imbalance of 
the dataset and the expectation that the measurements are equally precise in 
the two populations.


x <- 0.209; with(dat, c(sp <- mean(Dij[!analogs]>x), se<- 
mean(Dij[analogs]<=x), sp+se - 1))
[1] 0.9443561 0.9269231 0.8712792


So it appears that the dataset is quite well separated into two samples at the 
cutpoint 0.209   Re: [R] OT: (quasi-?) separation in a logistic GLM

Grant Izmirlian
NCI

On 15 Dec 2008, at 18:03, Gavin Simpson wrote:
> Dear List,
>
> Apologies for this off-topic post but it is R-related in the sense  
> that
> I am trying to understand what R is telling me with the data to hand.
>
> ROC curves have recently been used to determine a dissimilarity
> threshold for identifying whether two samples are from the same
"type"
> or not. Given the bashing that ROC curves get whenever anyone asks  
> about
> them on this list (and having implemented the ROC methodology in my
> analogue package) I wanted to try directly modelling the probability
> that two sites are analogues for one another for given dissimilarity
> using glm().
>
> The data I have then are a logical vector ('analogs') indicating  
> whether
> the two sites come from the same vegetation and a vector of the
> dissimilarity between the two sites ('Dij'). These are in a csv
file
> currently in my university web space. Each 'row' in this file
> corresponds to single comparison between 2 sites.
>
> When I analyse these data using glm() I get the familiar "fitted
> probabilities numerically 0 or 1 occurred" warning. The data do not  
> look
> linearly separable when plotted (code for which is below). I have read
> Venables and Ripley's discussion of this in MASS4 and other sources  
> that
> discuss this warning and R (Faraway's Extending the Linear Model  
> with R
> and John Fox's new Applied Regression, Generalized Linear Models, and
> Related Methods, 2nd Ed) as well as some of the literature on Firth's
> bias reduction method. But I am still somewhat unsure what
> (quasi-)separation is and if this is the reason for the warnings in  
> this
> case.
>
> My question then is, is this a separation issue with my data, or is it
> quasi-separation that I have read a bit about whilst researching this
> problem? Or is this something completely different?
>
> Code to reproduce my problem with the actual data is given below. I'd
> appreciate any comments or thoughts on this.
>
> #### Begin code snippet  
> ################################################
>
> ## note data file is ~93Kb in size
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/ 
> dat.csv"))
> head(dat)
> ## fit model --- produces warning
> mod <- glm(analogs ~ Dij, data = dat, family = binomial)
> ## plot the data
> plot(analogs ~ Dij, data = dat)
> fit.mod <- fitted(mod)
>> ord <- with(dat, order(Dij))
> with(dat, lines(Dij[ord], fit.mod[ord], col = "red", lwd = 2))
>
> #### End code snippet  
> ##################################################
>
> Thanks in advance
>
> Gavin
> -- 
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more possibly parallel threads

R help - Dec 2008 - OT: (quasi-?) separation in a logistic GLM

[R] OT: (quasi-?) separation in a logistic GLM

[R] OT: (quasi-?) separation in a logistic GLM

[R] OT: (quasi-?) separation in a logistic GLM

[R] OT: (quasi-?) separation in a logistic GLM

[R] OT: (quasi-?) separation in a logistic GLM

[R] OT: (quasi-?) separation in a logistic GLM

Maybe Matching Threads