Anthony Damico
2012-Sep-21 14:51 UTC
[R] Exactly Replicating Stata's Survey Data Confidence Intervals in R
Hi everyone, apologies if the answer to this is in an obvious place. I've
been searching for about a day and haven't found anything..
I'm trying to replicate Stata's confidence intervals in R with the
survey
package, and the numbers are very very close but not exact. My ultimate
goal is to replicate Berkeley's SDA website with R
(http://sda.berkeley.edu/),
which seems to match Stata precisely. Everything lines up, except for the
confidence intervals, and I'm wondering if there's a relatively
straightforward reason why the numbers are different.
To review the two major cross-package comparison documents on Dr. Lumley's
homepage of the survey package --
http://faculty.washington.edu/tlumley/survey/ucla-examples.pdf -- doesn't
contain confidence interval output.
http://faculty.washington.edu/tlumley/survey/YRBS-report-extension.pdf --
confirms that the confidence interval differences between SUDAAN and R in
Table 3 exist, but doesn't provide much detail about why.
I tried running the analysis example below in SUDAAN as well, which
calculated confidence intervals not matching either R or Stata -- I'm
confused why all three would be different.
I understand (as the report above quotes) these differences are
statistically inconsequential, but I aim to convince people to switch to R,
so hitting numbers dead-on might provide some reassurance to skeptics..
I've pasted some ultra-simple Stata code below and (more importantly) the
output it generates. Below that, I've pasted some commented R code that
shows how everything matches exactly except for the confidence intervals,
and then displays a number of my failed attempts at hitting the numbers
right on the nose.
Thanks!!
Anthony Damico
Kaiser Family Foundation
* Stata/MP 11.2
* example stata code to replicate in R
use http://www.ats.ucla.edu/stat/stata/library/apiclus1, clear
svyset dnum [pw=pw], fpc(fpc)
gen ell0 = 1
replace ell0 = 0 if ell != 0
svy: mean ell0
* results of the final command above:
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 1 Number of obs = 183
Number of PSUs = 15 Population size = 9235.4
Design df = 14
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
ell0 | .0218579 .0226225 -.0266624 .0703783
--------------------------------------------------------------
# R x64
# simple example using example code pulled from ?svymean
options(digits=20)
library(survey)
library(foreign)
# read the data file from a website, to make sure you're using the same
data in both R and stata
x <- read.dta(
"http://www.ats.ucla.edu/stat/stata/library/apiclus1.dta" )
dclus1<-svydesign(id=~dnum, fpc=~fpc, data=x)
# mean matches exactly
coef(svymean(~I(ell==0), dclus1))
# SE matches exactly
SE(svymean(~I(ell==0), dclus1))
# design df matches exactly
degf(dclus1)
# unwtd count matches exactly
unwtd.count( ~ell, dclus1)
# wtd count matches exactly
svytotal( ~!is.na(ell), dclus1)
# number of clusters match exactly
dclus1
# none of the confidence interval options match exactly
# the standard confint() will certainly be wrong,
# since stata gives an asymmetric confidence interval
confint(svymean(~I(ell==0), dclus1))
svyciprop(~I(ell==0), dclus1, method="me") # 2nd way to perform
confint()
# but none of the other methods match either..
svyciprop(~I(ell==0), dclus1, method="li")
svyciprop(~I(ell==0), dclus1, method="lo")
svyciprop(~I(ell==0), dclus1, method="as")
svyciprop(~I(ell==0), dclus1, method="be")
In case it's of any value, here's my R session info:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252
LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-50 survey_3.28-2
loaded via a namespace (and not attached):
[1] MASS_7.3-18 tools_2.15.1
[[alternative HTML version deleted]]
Thomas Lumley
2012-Sep-23 20:30 UTC
[R] Exactly Replicating Stata's Survey Data Confidence Intervals in R
On Sat, Sep 22, 2012 at 2:51 AM, Anthony Damico <ajdamico at gmail.com> wrote:> Survey: Mean estimation > > Number of strata = 1 Number of obs = 183 > Number of PSUs = 15 Population size = 9235.4 > Design df = 14 > > -------------------------------------------------------------- > | Linearized > | Mean Std. Err. [95% Conf. Interval] > -------------+------------------------------------------------ > ell0 | .0218579 .0226225 -.0266624 .0703783 > --------------------------------------------------------------This matches> svyciprop(~I(ell==0),dclus1,df=14,method="mean")2.5% 97.5% I(ell == 0) 0.0218579 -0.0266624 0.07038 as does this> confint(svymean(~I(ell==0),dclus1),df=14)2.5 % 97.5 % I(ell == 0)FALSE 0.92962174505 1.02666240796 I(ell == 0)TRUE -0.02666240796 0.07037825495 The df= argument is not explicitly documented in ?svyciprop, but it is in ?confint.svystat and ?svymean [I was slowed down a bit by the claim that the Stata intervals were asymmetric, but in fact they aren't] -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland