On Tue, 11 May 2010, Jos Elkink wrote:
> Hi all,
>
> I am running a set of logistic regressions, where we want to use some
> weights, and I am not sure whether what I am doing is reasonable or
> not.
>
> The dependent variable is turnout in an election - i.e. survey
> respondents were asked whether or not they voted. The percentage of
> those who say they voted is much higher than the actual turnout,
> probably due both to non-response bias and social desirability issues.
> So now the suggestion is to weigh the cases, to weigh down the
> respondents who say they voted and weigh more heavily those who did
> say they did not vote. So the questions that arise from this are:
>
> 1) Is it reasonable to use the distribution of the dependent variable
> to calculate the weights used in a logistic regression? It feels
> wrong, but I cannot find, so far, any sources on this.
Yes and no. There's nothing special about it being the dependent variable.
As with any other methods for handling missing data and measurement error, it
won't actually work, but it might reduce the bias.
However, there is something special about it being logistic regression model
with biased sampling only on the dependent variable. This is better known as
case-control sampling, and there isn't any bias for the coefficients of the
predictors, so reweighting won't help.
> 2) How to implement this in R? I tried the weights option in glm(),
> but I think that is meant for when you have one row in your data for
> multiple observations, not for this kind of weight. Although I have
> the McCullagh and Nelder book explaining in detail how glm() operates,
> I cannot find a similar book for svyglm(). Is svyglm() better for this
> type of weighting?
In general svyglm() is better for this type of weighting. The point estimates
are the same (and in fact are obtained from glm()), but the standard errors are
more appropriate. Under the unreasonable assumption that the weighting does
correct the bias, the standard errors will also be correct.
> 3) Where would I find a good source describing the estimation
> procedure, including weighting, applied in svyglm()?
Well, one source is the book of the package (see
http://faculty.washington.edu/tlumley/svybook/ for its web page). I'm
perhaps not the best person to say whether it's a good source. Chapters 5
and 6 on regression and 7 on post-stratification, raking and calibration would
be relevant.
There is much more detail about the general weighting approach in Sarndal,
Swensson, Wretman "Model Assisted Survey Sampling". Or you can search
for papers on "calibration" and "non-response". The survey
literature generally will not say that much about applying these methods to
regression modelling, but the principles are the same.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle