Kristof
2012-Sep-25 12:45 UTC
[R] Three Stage Sampling of categorical variable using 'survey' in R
For a sanitation project in Bangladesh I need to design a three stage sample survey to be representative of around 40 million people. I find myself suddenly with several challenges with which I struggle and would be gratefully for any help. As the questions are linked I kept them together rather than creating multiple posts 1) SURVEY DESIGN So far I designed mainly two stage cluster surveys but never did a three stage cluster survey design. It seems that in the analysis only the PSU is taken into account and enumeration area. So whatever happens at the second stage seems irrelevant to the analysis which seem odd to me. Our intention was to do a PPM at the first and the second stage and have same size takes in each enumeration area. The design would be to select 50 out of 150 Upazila's (sub-districts) as PSU using probability proportionate to size. The second stage would be 6 village-groups out of an average of 250 village-groups per Upazila using PPS use SRS to select 26 households in each of the 6 selected villages per Upazila. Total sample size 7800 Household is the BSU and where we need to calculate information on the individual level we are confident to be able to correct the sample weights for that. In the two stage sampling I managed to optimise in other projects I could base the sample design based on cost to optimise it but it seems more difficult with three stage sampling. 2) CATEGORICAL VARIABLES So far I worked mainly with binary data but now we are collecting ranked categorical variables and I'm not sure how to treat these. The categorical variable form a scale to adherence to a certain level of sanitation but the scale is not linear. 3) Using "R" instead of STATA While always wanted learn "R" I always found it hard to get my hear around it. Even with Rstudio and Rcommander installed. I installed the "survey" package and tried to read up on how to use it but fail to. Is their anybody willing to help? While I can get my head around basic probability principle in survey sampling I'm not a statistician so I'll bite my pride and ask to explain it as I would be a 10 year old just to be be sure I get it. Any good reference material is always welcome but more direct answers who help me the most due to time constraints as we have to finish the design in the days to come. Thanks a lot in advance for all your help Kristof -- View this message in context: http://r.789695.n4.nabble.com/Three-Stage-Sampling-of-categorical-variable-using-survey-in-R-tp4644110.html Sent from the R help mailing list archive at Nabble.com.
Thomas Lumley
2012-Sep-25 20:47 UTC
[R] Three Stage Sampling of categorical variable using 'survey' in R
On Wed, Sep 26, 2012 at 12:45 AM, Kristof <bostoen at irc.nl> wrote:> 1) SURVEY DESIGN > So far I designed mainly two stage cluster surveys but never did a three > stage cluster survey design. It seems that in the analysis only the PSU is > taken into account and enumeration area. So whatever happens at the second > stage seems irrelevant to the analysis which seem odd to me.There are two issues here that aren't the same. If you don't provide population size information the analysis depends only on the PSU, strata, weights, and measurements. If the sample is much smaller than the population, then even if you do provide population size information the analysis essentially depends only on the PSU, strata, weights, and measurements. This doesn't mean that the design doesn't matter after stage 1, it just means that the weights and the distribution of the measurements tells you everything about the subsequent stages that you need to know. In particular, the variability in weight*data is important, and different designs can give very different standard errors. The same design principles apply at later stages of design as at stage 1: stratifying on a variable correlated with the variable of interest will increase precision, and the Neyman allocation formula still tells you how to choose stratum sizes based on what you know about variance and cost. It's harder to optimize a multistage design because there are many more options and which design is best will depend on a lot of things you don't know, but it's not intrinsically different from optimising a single-stage design.> Our intention was to do a PPM at the first and the second stage and have > same size takes in each enumeration area. > The design would be to select 50 out of 150 Upazila's (sub-districts) as > PSU using probability proportionate to size. > The second stage would be 6 village-groups out of an average of 250 > village-groups per Upazila using PPS > use SRS to select 26 households in each of the 6 selected villages per > Upazila. Total sample size 7800 > Household is the BSU and where we need to calculate information on the > individual level we are confident to be able to correct the sample weights > for that.That sounds plausible -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland
Possibly Parallel Threads
- Questionnaire Analysis virtually without continuous Variables
- clusters in zero-inflated negative binomial models
- enter a survey design in survey2.9
- Cluster analysis on weighted survey data with continuous and categorical variables
- Quantile Regressions/Multi-stage complex survey design