hi all,
I am fitting a logistic regression model on binary data. I care about
the fitted probabilities, so I am not worried about infinite
(or non-existent) MLEs. I use:
> glm(Y~., data=X, weights=wgt, family=binomial(link=logit), maxit=250)
I understand the three ways to fit model, and in my case Y is a factor,
one column
> Y <- c(rep("A",679), rep("B",38))
My question is about the weights. I can use integer weights, which
makes more mathematical sense, and
> wgt <- c(rep(1,679), rep(17,38))
or i can use
> wgt <- c(rep(38/679,679, rep(1,38))
which makes more sense for my problem, but the mathematic is weak as I am
using non integer successes in a bernoulli... Since non-integer weights
make more sense, AND the predictions of my model actually get better on
the rare class. I estimate the accuracy 'out of the bag' over 10000
experiments to get
| integer wgt | non-int wgt
-------- + -------------------- + --------------------
accuracy | A = 94.9% B = 82.3% | A = 94.7% B = 83.3%
std.dev. | 2.3% 15.4% | 2.6% 13.2%
avg. AIC | 707 | 124
As I understand instead of augmenting the successes on the rare class,
which I did not observe, I am sinply down-weighting the successes on the
populus class. The populations can be thought as equal, and only the
sample sizes are unbalanced.
I was hoping that the continuity of the Binomial for N in [0,1] ans X in
[0,1] could guarantee me that my results still make sense, but I am not
sure. Any thoughts? Thanks
Edo