Hi:
I think you created a problem for yourself in the way you generated your
data.
y<-rbinom(2000,1,.7)
euro <- rnorm(2000, m = 300 * y + 50 * (1 - y), s = 20 * y + 12 * (1 - y))
# Create a 2000 x 2 matrix of probabilities
prmat <- cbind(0.8 * y + 0.2 * (1 - y), 0.2 * y + 0.8 * (1 - y))
# sample sex from each row of prmat with the rows comprising the
distribution
sex <- apply(prmat, 1, function(x) sample(c('m', 'f'), 1,
prob = x))
df <- data.frame(euro, sex, y)
# Histogram of euro: notice the separation in distributions
hist(euro, nclass = 50)
# Generate an indicator between the two clusters of euro
spl <- euro > 150
# Now show a table of that split vs. response
table(spl, y)
This is what I get for my simulation:
table(spl, y)
y
spl 0 1
FALSE 572 0
TRUE 0 1428
which in turn leads to
m <- glm(y ~ euro + sex, data = df, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
This is what is known as 'complete data separation' in the logistic
regression literature. Basically, you've generated data so that all the
successes are associated with a N(300, 20) distribution and all the failures
with a N(50, 12) distribution. If this is a classification problem,
congratulations - the margin on the support vector will be huge :) OTOH, if
you're trying to fit a logistic model for purposes of explanation,
you've
created a problem, especially with respect to prediction.
...and it does matter whether this is a regression problem or a
classification problem. In the latter, separation is a good thing; in the
former, it creates convergence problems.
Since you have a continuous predictor in the model, there is an additional
complication: the logistic regression null deviance does not have an
asymptotic chi-square distribution, so tests involving reductions of
deviance from the null model are not guaranteed to have asymptotic
chi-square distributions *when the predictor x is truly continuous*.
More below.
On Wed, Dec 29, 2010 at 9:48 AM, Federico Bonofiglio
<bonoricus@gmail.com>wrote:
> Dear Masters,
> first I'd like to wish u all a great 2011 and happy holydays by now,
>
> second (here it come the boring stuff) I have a question to which I hope u
> would answer:
>
> I run a logistic regression by glm(), on the following data type
> (y1=1,x1=x1); (y2=0,x2=x2);......(yn=0,xn=xn), where the response (y) is
> abinary outcome on 0,1 amd x is any explanatory variable (continuous or
> not)
> observed with the i-th outcome.
>
> This is indeed one of the most frequent case when challenged with binary
> responses, though I know R manages such responses slightly differently (a
> vector for the successes counts and one for the failures) and I'm not
sure
> wheather my summary.glm gives me any senseful answer at all....
>
> for the purpose I have tried to to emphasize the numbers so to obtain
> significant results
>
> y<-rbinom(2000,1,.7)#response
>
> for(i in 1:2000){
> euro[i]<-if(y[i]==1){rnorm(1,300,20)#explanatory 1
> }else{rnorm(1,50,12)}
> }
>
> for(i in 1:2000){
>
sex[i]<-if(y[i]==1){sample(c("m","f"),1,prob=c(.8,.2))#explanatory
2
> }else{sample(c("m","f"),1,prob=c(.2,.8))}
> }
>
>
>
> m<-glm(y~euro+factor(sex),family=binomial)
>
> summary(m)
>
>
>
>
> My worries:
>
> - are the estimates correct?
>
The people who wrote the glm() routine were smart enough to anticipate the
data separation case and are warning you of potential instability in the
model estimates/predictions as a result of producing predictions of exactly
0 or 1. This is a warning to take seriously - your generated data produced
these based on x alone.
- degrees of freedom exponentiate dramatically (one per cell) , so may
I> risk to never obtain a significant result?
>
When using grouped or ungrouped data, comparisons between nested models will
be the same whether the data are grouped or ungrouped in non-pathological
situations.
>
> I also take the chance to ask wheater u know any implemented method to plot
> logistic curves directly out of a glm() model
>
The following is an example to illustrate some of the questions you raised.
# Example to illustrate the difference between grouped and ungrouped
# logistic regression analyses
library(reshape2)
library(lattice)
# Sample 50 distinct x values 300 times
x <- sample(1:50, 300, replace = TRUE)
# P(Y = 1 | X) increases with x
y <- rbinom(300, 1, (10 + x)/80)
ind <- x > 25
# males sampled more heavily when x > 25
p <- cbind(0.7 * ind + 0.3 * (1 - ind), 0.3 * ind + 0.7 * (1 - ind))
sex <- apply(p, 1, function(x) sample(c('m', 'f'), 1, prob =
x))
df <- data.frame(sex, x, y)
# Ungrouped logistic regression
# treat x as a continuous covariate
m1 <- glm(y ~ sex + x, data = df, family = binomial)
# Group the data by x * sex combinations
u <- as.data.frame(xtabs(~ y + sex + x, data = df))
# cast() reshapes the data so that the 0/1 frequencies become separate
columns
# cast() comes from package reshape(2)
# In this case, sex and x are ID variables, and y is reshaped from long to
wide form
v <- cast(u, sex + x ~ y, value = 'Freq')
# rename new columns
names(v)[3:4] <- c('f', 's')
# x converted to factor in the process; undo and return to numeric
v$x <- as.numeric(as.character(v$x))
# Grouped logistic regression
# still treat x as a continuous covariate - same model as above
m2 <- glm(cbind(s, f) ~ sex + x, data = v, family = binomial)
summary(m1) # ungrouped data
summary(m2) # grouped data
# If all sex * x combinations are sampled, there should be 97 df for
residual deviance;
# anything less than that means some combinations had zero frequency
# Prediction and plot of the estimated response probabilities:
pframe <- data.frame(sex = rep(c('m', 'f'), each = 50), x =
rep(1:50, 2))
pframe$pp <- predict(m2, newdata = pframe, type = 'response')
xyplot(pp ~ x, data = pframe, groups = sex, type = 'l') # from package
lattice
Note that the coefficients and tests are the same in the two models, but the
deviances (null and residual) are different with different degrees of
freedom. Because of this, the AICs will also be different. Try this again
with a sample of 3000; the predicted probabilities in the two groups will be
closer since sex is not a significant factor, but the total and residual df
will be close to the same as in the n = 300 case for the *grouped data*, but
not the ungrouped data. Also observe that in the n = 3000 case, the
estimates and tests of the model effects are again the same between the
grouped and ungrouped data. In other words, increasing the sample size
doesn't, in itself, create inferential problems.
Observe, however, that the same 50 x values are being sampled in both cases,
so grouping has a positive effect in this case, particularly wrt residual
analysis. There is redundant information among the inputs. Even though we
are acting as though x is continuous, it is (at least conditionally) an
ordered factor. By behaving as though it is continuous, we are in effect
saying that the linear contrast dominates the overall effect of x on the
response.
[Note that in the grouped data case, there are 50 x 2 = 100 possible x * sex
combinations. With three parameters to estimate, the maximum residual df for
grouped data is 100 - 3 = 97, assuming all combinations have frequency >1.]
In contrast, if you sample x from a normal distribution, say, then grouping
the data by x provides no real reduction unless you artificially bin the x's
into intervals, as in a histogram. In that case, however, the estimates and
tests will not be exactly the same between the ungrouped and grouped data
because, in this situation, binning coarsens the information in x.
For yet another perspective, tweak my code above as follows: sample x from a
normal distribution instead of from 1:50 and use the same code to group the
data. How much difference is there between the ungrouped and grouped data?
What would happen if we repeated the exercise with a sample of size 3000
from a normal distribution and then grouped it? Would the same sized table
of grouped sex * x occur in both cases?
Some related topics to search on in regard to the issues you raised include:
(i) separability of data (as mentioned earlier); (ii) the Hauck-Donner
effect and (iii) the Neyman-Scott problem (which comes into play when
considering the asymptotic distribution of the null deviance when a
continuous covariate is present in a logistic regression model).
Since I'm still learning this subject myself, corrections/amplifications
would be welcome.
HTH,
Dennis
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]