Anthony Damico
2012-Sep-21 14:51 UTC
[R] Exactly Replicating Stata's Survey Data Confidence Intervals in R
Hi everyone, apologies if the answer to this is in an obvious place. I've been searching for about a day and haven't found anything.. I'm trying to replicate Stata's confidence intervals in R with the survey package, and the numbers are very very close but not exact. My ultimate goal is to replicate Berkeley's SDA website with R (http://sda.berkeley.edu/), which seems to match Stata precisely. Everything lines up, except for the confidence intervals, and I'm wondering if there's a relatively straightforward reason why the numbers are different. To review the two major cross-package comparison documents on Dr. Lumley's homepage of the survey package -- http://faculty.washington.edu/tlumley/survey/ucla-examples.pdf -- doesn't contain confidence interval output. http://faculty.washington.edu/tlumley/survey/YRBS-report-extension.pdf -- confirms that the confidence interval differences between SUDAAN and R in Table 3 exist, but doesn't provide much detail about why. I tried running the analysis example below in SUDAAN as well, which calculated confidence intervals not matching either R or Stata -- I'm confused why all three would be different. I understand (as the report above quotes) these differences are statistically inconsequential, but I aim to convince people to switch to R, so hitting numbers dead-on might provide some reassurance to skeptics.. I've pasted some ultra-simple Stata code below and (more importantly) the output it generates. Below that, I've pasted some commented R code that shows how everything matches exactly except for the confidence intervals, and then displays a number of my failed attempts at hitting the numbers right on the nose. Thanks!! Anthony Damico Kaiser Family Foundation * Stata/MP 11.2 * example stata code to replicate in R use http://www.ats.ucla.edu/stat/stata/library/apiclus1, clear svyset dnum [pw=pw], fpc(fpc) gen ell0 = 1 replace ell0 = 0 if ell != 0 svy: mean ell0 * results of the final command above: (running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 183 Number of PSUs = 15 Population size = 9235.4 Design df = 14 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ ell0 | .0218579 .0226225 -.0266624 .0703783 -------------------------------------------------------------- # R x64 # simple example using example code pulled from ?svymean options(digits=20) library(survey) library(foreign) # read the data file from a website, to make sure you're using the same data in both R and stata x <- read.dta( "http://www.ats.ucla.edu/stat/stata/library/apiclus1.dta" ) dclus1<-svydesign(id=~dnum, fpc=~fpc, data=x) # mean matches exactly coef(svymean(~I(ell==0), dclus1)) # SE matches exactly SE(svymean(~I(ell==0), dclus1)) # design df matches exactly degf(dclus1) # unwtd count matches exactly unwtd.count( ~ell, dclus1) # wtd count matches exactly svytotal( ~!is.na(ell), dclus1) # number of clusters match exactly dclus1 # none of the confidence interval options match exactly # the standard confint() will certainly be wrong, # since stata gives an asymmetric confidence interval confint(svymean(~I(ell==0), dclus1)) svyciprop(~I(ell==0), dclus1, method="me") # 2nd way to perform confint() # but none of the other methods match either.. svyciprop(~I(ell==0), dclus1, method="li") svyciprop(~I(ell==0), dclus1, method="lo") svyciprop(~I(ell==0), dclus1, method="as") svyciprop(~I(ell==0), dclus1, method="be") In case it's of any value, here's my R session info:> sessionInfo()R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] foreign_0.8-50 survey_3.28-2 loaded via a namespace (and not attached): [1] MASS_7.3-18 tools_2.15.1 [[alternative HTML version deleted]]
Thomas Lumley
2012-Sep-23 20:30 UTC
[R] Exactly Replicating Stata's Survey Data Confidence Intervals in R
On Sat, Sep 22, 2012 at 2:51 AM, Anthony Damico <ajdamico at gmail.com> wrote:> Survey: Mean estimation > > Number of strata = 1 Number of obs = 183 > Number of PSUs = 15 Population size = 9235.4 > Design df = 14 > > -------------------------------------------------------------- > | Linearized > | Mean Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > ell0 | .0218579 .0226225 -.0266624 .0703783 > --------------------------------------------------------------This matches> svyciprop(~I(ell==0),dclus1,df=14,method="mean")2.5% 97.5% I(ell == 0) 0.0218579 -0.0266624 0.07038 as does this> confint(svymean(~I(ell==0),dclus1),df=14)2.5 % 97.5 % I(ell == 0)FALSE 0.92962174505 1.02666240796 I(ell == 0)TRUE -0.02666240796 0.07037825495 The df= argument is not explicitly documented in ?svyciprop, but it is in ?confint.svystat and ?svymean [I was slowed down a bit by the claim that the Stata intervals were asymmetric, but in fact they aren't] -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland