Anthony Damico
2012-Aug-10 17:53 UTC
[R] Direct Method Age-Adjustment to Complex Survey Data
Hi everyone, my apologies in advance if I'm overlooking something simple in
this question. I am trying to use R's survey package to make a direct
method age-adjustment to some complex survey data. I have played with
postStratify, calibrate, rake, and simply multiplying the base weights by
the correct proportions - nothing seems to hit the published numbers on the
nose.
I am trying to replicate any of the following:
the stdize and stdweight parameters from a Stata complex survey
analysis command (like this)
svyset [w= WEIGHTVAR], psu(PSUVAR) strata(STRATAVAR) vce(linearized)
svy: mean VARNAME, stdize(agecat) stdweight(agewt)
the stdvar and stdwgt parameters from a SUDAAN proc descript call
the ESTIMATE statement in a PROC SURVEYREG (used to generate adjusted
means) in SAS version 9.2
I am attempting to match figure 1 from nchs data brief #92 exactly using
R. The data are here: http://www.cdc.gov/nchs/data/databriefs/db92_fig1.png
and the full brief is here:
http://www.cdc.gov/nchs/data/databriefs/db92.htm This is the National
Health and Nutrition Examination Survey, produced by the US CDC.
The CDC has published a bunch of syntax examples for generating
age-adjusted statistics with SUDAAN, SAS, and Stata. Here's a link with
all of that syntax:
http://www.cdc.gov/nchs/tutorials/NHANES/NHANESAnalyses/agestandardization/age_standardization_intro.htm
I've provided my detailed code below that gets to the point of the age
adjustment, and then fails. However, when I export my R data frame near
the end of the process and run a few Stata commands, I get the correct
numbers easily. Note that the code below uses the download.file command
twice to pull the (very small) data sets onto your local computer.
Any guidance about how to perform this age-adjustment would be appreciated!
Thanks!!
Anthony Damico
Kaiser Family Foundation
require(foreign)
require(survey)
############# download the NHANES 2009-2010 demographics and total
cholesterol files ##################
tf <- tempfile()
download.file(
"
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2009-2010/demo_f.xpt" ,
tf ,
mode = "wb"
)
NHANES.0910.demographics.df <- read.xport( tf )
tf <- tempfile()
download.file(
"
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2009-2010/TCHOL_F.xpt"
,
tf ,
mode = "wb"
)
NHANES.0910.TCHOL_F.df <- read.xport( tf )
############# merge them together, keeping only a few columns
##################
DemoKeepVars <- c( "SEQN" , "WTMEC2YR" ,
"RIDSTATR" , "SDMVPSU" ,
"SDMVSTRA" , "RIDAGEYR" , "RIAGENDR" ,
"RIDRETH1" )
TCHOLKeepVars <- c( "SEQN" , "LBXTC" )
x <-
merge(
NHANES.0910.demographics.df[ , DemoKeepVars ] ,
NHANES.0910.TCHOL_F.df[ , TCHOLKeepVars ] ,
all = T
)
# keep only individuals who took the "mobile examination"
x <- subset( x , RIDSTATR %in% 2 )
# make a few recodes..
x <-
transform(
x ,
# high cholesterol
HI_CHOL ifelse( LBXTC >= 240 , 1 , 0 ) ,
# four race/ethnicity categories
race ifelse( RIDRETH1 %in% 1:2 , 3 ,
ifelse( RIDRETH1 %in% 3 , 1 ,
ifelse( RIDRETH1 %in% 4 , 2 ,
ifelse( RIDRETH1 %in% 5 , 4 ,
NA ) ) ) ) ,
# four age categories
agecat ifelse( RIDAGEYR %in% 0:19 , 0 ,
ifelse( RIDAGEYR %in% 20:39 , 1 ,
ifelse( RIDAGEYR %in% 40:59 , 2 ,
ifelse( RIDAGEYR >= 60 , 3 ,
NA ) ) ) )
)
# 2000 census populations to adjust to
# from excel file:
http://www.cdc.gov/nchs/tutorials/nhanes/Downloads/Continuous/ageadjwt.xls
# found on page:
http://www.cdc.gov/nchs/tutorials/nhanes/nhanesanalyses/agestandardization/Task1c.htm
pop.by.age <- c( 55901 , 77670 , 72816 , 45364 )
# calculate each age group's percent of the total
share.pop.by.age <- pop.by.age / sum( pop.by.age )
x <-
transform(
x ,
agewt ifelse( agecat %in% 0 , share.pop.by.age[ 1 ] ,
ifelse( agecat %in% 1 , share.pop.by.age[ 2 ] ,
ifelse( agecat %in% 2 , share.pop.by.age[ 3 ] ,
ifelse( agecat %in% 3 , share.pop.by.age[ 4 ] , NA ) ) ) ) )
# initiate the survey object
y <-
svydesign(
id = ~SDMVPSU ,
strata = ~SDMVSTRA ,
nest = TRUE ,
weights = ~WTMEC2YR ,
data = x
)
# only perform analyses among individuals aged 20 and over
z <- subset( y , RIDAGEYR >= 20 )
# these four commands calculate the correct non-age-adjusted estimate
svymean( ~HI_CHOL , z , na.rm = T )
svyby( ~HI_CHOL , ~race , z , svymean , na.rm = T )
svyby( ~HI_CHOL , ~RIAGENDR , z , svymean , na.rm = T )
svyby( ~HI_CHOL , ~RIAGENDR+race , z , svymean , na.rm = T )
# but matching the figure exactly requires an exact age adjustment.
# create the population types vector
pop.types <-
data.frame(
agecat = 0:3 ,
Freq = c( 55901 , 77670 , 72816 , 45364 )
)
z.postStratified <- postStratify( z , ~agecat , pop.types , partial = T )
# these estimates are close, but not exactly right.
svymean( ~HI_CHOL , z.postStratified , na.rm = T )
svyby( ~HI_CHOL , ~race , z.postStratified , svymean , na.rm = T )
svyby( ~HI_CHOL , ~RIAGENDR , z.postStratified , svymean , na.rm = T )
svyby( ~HI_CHOL , ~RIAGENDR+race , z.postStratified , svymean , na.rm = T )
###########export for stata#################
# write the file to a stata file on your local drive..
write.dta( x , "c:/temp/temp.dta" )
#######STATA/MP 11.2 COMMANDS###############
use "c:\temp\temp.dta", clear
svyset [w= WTMEC2YR], psu(SDMVPSU) strata(SDMVSTRA) vce(linearized)
gen adult = 1 if agecat > 0
# these three commands produce the correct results
svy, subpop(adult): mean HI_CHOL, stdize(agecat) stdweight(agewt)
svy, subpop(adult): mean HI_CHOL, stdize(agecat) stdweight(agewt) over(
race )
svy, subpop(adult): mean HI_CHOL, stdize(agecat) stdweight(agewt) over(
race RIAGENDR )
[[alternative HTML version deleted]]
Thomas Lumley
2012-Aug-13 04:27 UTC
[R] Direct Method Age-Adjustment to Complex Survey Data
On Sat, Aug 11, 2012 at 5:53 AM, Anthony Damico <ajdamico at gmail.com> wrote:> Hi everyone, my apologies in advance if I'm overlooking something simple in > this question. I am trying to use R's survey package to make a direct > method age-adjustment to some complex survey data. I have played with > postStratify, calibrate, rake, and simply multiplying the base weights by > the correct proportions - nothing seems to hit the published numbers on the > nose.<snip>> # but matching the figure exactly requires an exact age adjustment. > > # create the population types vector > pop.types <- > data.frame( > agecat = 0:3 , > Freq = c( 55901 , 77670 , 72816 , 45364 ) > ) > > > z.postStratified <- postStratify( z , ~agecat , pop.types , partial = T )The standardization in the CDC examples is within each subpopulation. That is, they standardise each race/ethnicity group to the Census age structure, rather than standardising the whole population. That's the whole point -- they want to look at an imaginary population where age and race aren't confounded. When I do this, it almost exactly matches. The next step was to drop all the missing data and reweight just the non-missing data. That works exactly. (I also think you have the wrong recoding of RIDRETH1). demog<-read.xport("~/Downloads/demo_f.xpt") chol<-read.xport("~/Downloads/TCHOL_f.xpt") alldata<-merge(demog,chol) alldata<-subset(alldata, RIDSTATR %in% 2) alldata<-transform(alldata, HI_CHOL = ifelse(LBXTC>=240,1,0)) alldata<-transform(alldata, race=c(1,1,2,3,4)[RIDRETH1]) alldata<-transform(alldata, agecat=cut(RIDAGEYR,c(0,19,39,59, Inf))) popage<-c(55901,77670,72816,45364) racegender<-as.data.frame(svytable(~race+RIAGENDR,design)) racegenderage<-expand.grid(race=1:4,RIAGENDR=1:2,agecat=levels(alldata$agecat)) racegenderage$Freq<- as.vector(outer(racegender$Freq, popage/sum(popage))) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA,nest=TRUE,weights=~WTMEC2YR,data=alldata) svyby(~HI_CHOL,~race+RIAGENDR,design=subset(postStratify(design,~race+RIAGENDR+agecat,racegenderage),RIDAGEYR>=20),svymean,na.rm=TRUE) somedata<-subset(alldata, !is.na(LBXTC)) design1 <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA,nest=TRUE,weights=~WTMEC2YR,data=somedata) svyby(~HI_CHOL,~race+RIAGENDR,design=subset(postStratify(design1,~race+RIAGENDR+agecat,racegenderage),RIDAGEYR>=20),svymean,na.rm=TRUE) -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland