Adam C Sales
2016-Nov-02 20:00 UTC
[R] svyglm in the survey package: large regression models with replicate weights give infinite standard errors
I'm running some regressions on data from the American Community Survey, with replicate weights for variance estimation, using the svyglm function from the survey package. For my larger regressions, all of the standard errors were reported as Inf. I did a bit of digging through the code, and figure out why that was: the svrepglm function (see here: https://github.com/cran/survey/blob/9aea3247ea1966667aba3efeeaf4a574e819f924/R/surveyrep.R) calculates standard errors using the formula I would expect, but then adjusts the "degrees of freedom" for the fit with the following line: full$df.residual<-degf(design)+1-length(coef(full)[!is.na(coef(full))]) (where 'full' is the name of the model and 'design' is the specified survey design) It turns out that degf(design) returns the rank of the weight matrix, which, in the case of the ACS at least, is the number of replicate weights, 80. If there are more than 80 columns in the regression design matrix, the df.residual entry in the model is negative, which causes the summary.glm() function to report infinite standard errors. My question is this: Is this a bug in the survey package, or are the standard errors really unidentified for large regression models? Or am I using the wrong settings? If it's just a bug fixing it is easy enough, and that's what I've been doing, but I'd like to make sure I'm not glossing over a serious statistical limitation of the data. Here's a small working example. It's artificial in that I chose an arbitrary regression that had enough regressors. It takes a long time to run. ### step 1: download from http://www2.census.gov/programs-surveys/acs/data/pums/2014/1-Year/csv_pus.zip ### step 2: unzip, put file ss14pusa.csv in working directory library(survey) pums <- read.csv('ss14pusa.csv') ## some of the replication weights are negative. you might want to set them to zero: weightcols <- grep('pwgtp[1-9]',names(pums)) pums[,weightcols] <- apply(pums[,weightcols],2,function(x) ifelse(x<0,0,x)) ## there's some lack of clarity online on how to set these things up. Either way you get infinite standard errors ## Anthony Damico recommends (see: https://stat.ethz.ch/pipermail/r-help/2014-September/422133.html): options( "survey.replicates.mse" = TRUE ) des <- svrepdesign(data=pums,weights=~PWGTP,repweights="pwgtp[1-9]",scale=4/80,rscales=rep(1,80)) ## However, I think the following is correct: des <- svrepdesign(data=pums,weights=~PWGTP,repweights="pwgtp[1-9]",scale=4/80,rscales=rep(1,80),type='Fay',rho=0.5) ## Anyway, here's the problem: summary(mod <- svyglm(AGEP~as.factor(ST)+as.factor(CIT)+as.factor(MAR)+as.factor(RELP)+as.factor(HISP)+as.factor(RAC2P),design=des)) -- UT College of Education SMARTER Consulting