I have survey data that I am working on. I need to make some multi-way tables and regression analyses on the data. After attaching the data, this is the code I use for tables for four variables (sweight is the weight variable):> a <- xtabs(sweight~research.area + gender + a2n2 + age) > tmp <- ftable(a)Is this correct? I don't think I need to use the strata and cluster variables, right? And, below is the logistic regression code that I use for randomly sampled, or unweighted, data:> logit.1 <- glm(var4 ~ var3 + var2 + var1, family = binomial(link > "logit")) > summary(logit.1)But how can I do the same analyses for the weighted data? Here is some additional info: There are four variables in the dataset that reflect the sampling structure. These are strat: stratum (urban or (sub-county) rural). clust: batch of interviews that were part of the same random walk vill_neigh_code: village or neighbourhood code sweight: weights -- View this message in context: http://r.789695.n4.nabble.com/crosstable-and-regression-for-survey-data-weighted-tp4634083.html Sent from the R help mailing list archive at Nabble.com.
Pablo DomÃnguez Vaselli
2012-Jun-22 16:10 UTC
[R] crosstable and regression for survey data (weighted)
Regarding regression models, there's a bit of discussion on whether or not it is necessary to take the sample design into account (for instance, SPSS doesn't), so you can run them just normally without much remorse. Or get your life complicated (see below). Your xtabs call seems OK to me. However, regarding tables and totals, you can expand cases as SPSS and most software does (frequency weights) with this code: mydata.x <- mydata[rep(1:nrow(mydata),mydata$sweight),] Once your dataframe is expanded this way, any totals and crosstabulations will be right without setting any count variable on xtabs or other functions and using just about any normal call you want (i.e. aggregate(), table(), etc.). This approach is memory-intensive, the dataframe will be as large as the target population. However, in order to properly deal with complex sample data you need the survey package (I think this is the only sound approach to your modelling problem). This package will enable you to calculate design effects, variance estimators and regression modelling taking the survey design into account without hitting the RAM as above. In that case, you must first feed the design variables to a survey design object, using something like:> library(survey) > mydesign <- svydesign(ids=~vill_neigh_code+clust, strata=~stratum,weights=~sweight, data=mydata) Do check the survey package's vignette and help files, this is tricky. It will also help to have the neighbors population. You must also check their nesting (that is, if the clusters ids reuse names across strata). Note the survey package has special functions for just about anything (including getting your frequencies), all of them start with "svy" such as in "svytable" and return variance estimators (note your estimation's errors will vary tab-wise in such a complex design. Survey example:>data(api) >xtabs(~sch.wide+stype, data=apipop) >dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) >summary(dclus1) >(tbl <- svytable(~sch.wide+stype, dclus1))Once you've specified your survey design, you can fit a design-conscious glm model using:>mymodel <- svyglm(var1~var2+var3, design=mydesign, family=quasibinomial())If you're out of time just use normal xtabs and glm! [[alternative HTML version deleted]]
Thanks Pablo for your answer, it was very insightful, but I guess I got something wrong. I formed a survey design as:> library(survey) > mydesign <- svydesign(ids=~vill_neigh_code+clust, strata=~strat, > weights=~sweight, data=mydata)where strat: stratum (urban or (sub-county) rural). clust: batch of interviews that were part of the same random walk vill_neigh_code: village or neighbourhood code sweight: probability weights Then, I run a logistic regression as> logit.1 <- svyglm(response~var1+var2+var3+var4+var5+var6, design=mydesign, > data=mydata, nest=TRUE, family=quasibinomial())And I get this error message: Error in svyglm.survey.design(response ~ var1 + var2 + var3 + var4 + : all variables must be in design= argument What should I change in the syntax in this case? -- View this message in context: http://r.789695.n4.nabble.com/crosstable-and-regression-for-survey-data-weighted-tp4634083p4634617.html Sent from the R help mailing list archive at Nabble.com.
Pablo DomÃnguez Vaselli
2012-Jun-29 11:16 UTC
[R] crosstable and regression for survey data (weighted)
It seems the var names you've put are not the same as in the design object: "all variables must be in design= argument ": that means the object you've assigned in mydesign <- svydesign(ids=~vill_neigh_code+clust, strata=~strat, weights=~sweight, data=mydata) Check the spelling. Note that the "mydesign" is *not* a dataframe. That means that mydesign[,5] or mydesign$myvar won't work (off course neither will naming the original dataframe "mydata"), you must just use the variable names alone for instance: svyglm(api00~ell+meals+mobility, design=dstrat) is correct, using only the var names, not dstrat[smth]~dstrat[smth]+ dstrat[smth] If you write the names correctly it should work regards pablo [[alternative HTML version deleted]]
Thanks Pablo, There must be a spelling issue then although I can get the tables and other stuff on the same variables. In this case, I will go for the glm below, and hopefully this will not make the results too bad. mylogit <- glm(response~ var1+ var2+ var3+ var4+ var5+ var6, weights sweight, family = quasibinomial(link = "logit")) -- View this message in context: http://r.789695.n4.nabble.com/crosstable-and-regression-for-survey-data-weighted-tp4634083p4634950.html Sent from the R help mailing list archive at Nabble.com.