Anthony Damico
2012-Aug-10 17:53 UTC
[R] Direct Method Age-Adjustment to Complex Survey Data
Hi everyone, my apologies in advance if I'm overlooking something simple in this question. I am trying to use R's survey package to make a direct method age-adjustment to some complex survey data. I have played with postStratify, calibrate, rake, and simply multiplying the base weights by the correct proportions - nothing seems to hit the published numbers on the nose. I am trying to replicate any of the following: the stdize and stdweight parameters from a Stata complex survey analysis command (like this) svyset [w= WEIGHTVAR], psu(PSUVAR) strata(STRATAVAR) vce(linearized) svy: mean VARNAME, stdize(agecat) stdweight(agewt) the stdvar and stdwgt parameters from a SUDAAN proc descript call the ESTIMATE statement in a PROC SURVEYREG (used to generate adjusted means) in SAS version 9.2 I am attempting to match figure 1 from nchs data brief #92 exactly using R. The data are here: http://www.cdc.gov/nchs/data/databriefs/db92_fig1.png and the full brief is here: http://www.cdc.gov/nchs/data/databriefs/db92.htm This is the National Health and Nutrition Examination Survey, produced by the US CDC. The CDC has published a bunch of syntax examples for generating age-adjusted statistics with SUDAAN, SAS, and Stata. Here's a link with all of that syntax: http://www.cdc.gov/nchs/tutorials/NHANES/NHANESAnalyses/agestandardization/age_standardization_intro.htm I've provided my detailed code below that gets to the point of the age adjustment, and then fails. However, when I export my R data frame near the end of the process and run a few Stata commands, I get the correct numbers easily. Note that the code below uses the download.file command twice to pull the (very small) data sets onto your local computer. Any guidance about how to perform this age-adjustment would be appreciated! Thanks!! Anthony Damico Kaiser Family Foundation require(foreign) require(survey) ############# download the NHANES 2009-2010 demographics and total cholesterol files ################## tf <- tempfile() download.file( " ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2009-2010/demo_f.xpt" , tf , mode = "wb" ) NHANES.0910.demographics.df <- read.xport( tf ) tf <- tempfile() download.file( " ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2009-2010/TCHOL_F.xpt" , tf , mode = "wb" ) NHANES.0910.TCHOL_F.df <- read.xport( tf ) ############# merge them together, keeping only a few columns ################## DemoKeepVars <- c( "SEQN" , "WTMEC2YR" , "RIDSTATR" , "SDMVPSU" , "SDMVSTRA" , "RIDAGEYR" , "RIAGENDR" , "RIDRETH1" ) TCHOLKeepVars <- c( "SEQN" , "LBXTC" ) x <- merge( NHANES.0910.demographics.df[ , DemoKeepVars ] , NHANES.0910.TCHOL_F.df[ , TCHOLKeepVars ] , all = T ) # keep only individuals who took the "mobile examination" x <- subset( x , RIDSTATR %in% 2 ) # make a few recodes.. x <- transform( x , # high cholesterol HI_CHOL ifelse( LBXTC >= 240 , 1 , 0 ) , # four race/ethnicity categories race ifelse( RIDRETH1 %in% 1:2 , 3 , ifelse( RIDRETH1 %in% 3 , 1 , ifelse( RIDRETH1 %in% 4 , 2 , ifelse( RIDRETH1 %in% 5 , 4 , NA ) ) ) ) , # four age categories agecat ifelse( RIDAGEYR %in% 0:19 , 0 , ifelse( RIDAGEYR %in% 20:39 , 1 , ifelse( RIDAGEYR %in% 40:59 , 2 , ifelse( RIDAGEYR >= 60 , 3 , NA ) ) ) ) ) # 2000 census populations to adjust to # from excel file: http://www.cdc.gov/nchs/tutorials/nhanes/Downloads/Continuous/ageadjwt.xls # found on page: http://www.cdc.gov/nchs/tutorials/nhanes/nhanesanalyses/agestandardization/Task1c.htm pop.by.age <- c( 55901 , 77670 , 72816 , 45364 ) # calculate each age group's percent of the total share.pop.by.age <- pop.by.age / sum( pop.by.age ) x <- transform( x , agewt ifelse( agecat %in% 0 , share.pop.by.age[ 1 ] , ifelse( agecat %in% 1 , share.pop.by.age[ 2 ] , ifelse( agecat %in% 2 , share.pop.by.age[ 3 ] , ifelse( agecat %in% 3 , share.pop.by.age[ 4 ] , NA ) ) ) ) ) # initiate the survey object y <- svydesign( id = ~SDMVPSU , strata = ~SDMVSTRA , nest = TRUE , weights = ~WTMEC2YR , data = x ) # only perform analyses among individuals aged 20 and over z <- subset( y , RIDAGEYR >= 20 ) # these four commands calculate the correct non-age-adjusted estimate svymean( ~HI_CHOL , z , na.rm = T ) svyby( ~HI_CHOL , ~race , z , svymean , na.rm = T ) svyby( ~HI_CHOL , ~RIAGENDR , z , svymean , na.rm = T ) svyby( ~HI_CHOL , ~RIAGENDR+race , z , svymean , na.rm = T ) # but matching the figure exactly requires an exact age adjustment. # create the population types vector pop.types <- data.frame( agecat = 0:3 , Freq = c( 55901 , 77670 , 72816 , 45364 ) ) z.postStratified <- postStratify( z , ~agecat , pop.types , partial = T ) # these estimates are close, but not exactly right. svymean( ~HI_CHOL , z.postStratified , na.rm = T ) svyby( ~HI_CHOL , ~race , z.postStratified , svymean , na.rm = T ) svyby( ~HI_CHOL , ~RIAGENDR , z.postStratified , svymean , na.rm = T ) svyby( ~HI_CHOL , ~RIAGENDR+race , z.postStratified , svymean , na.rm = T ) ###########export for stata################# # write the file to a stata file on your local drive.. write.dta( x , "c:/temp/temp.dta" ) #######STATA/MP 11.2 COMMANDS############### use "c:\temp\temp.dta", clear svyset [w= WTMEC2YR], psu(SDMVPSU) strata(SDMVSTRA) vce(linearized) gen adult = 1 if agecat > 0 # these three commands produce the correct results svy, subpop(adult): mean HI_CHOL, stdize(agecat) stdweight(agewt) svy, subpop(adult): mean HI_CHOL, stdize(agecat) stdweight(agewt) over( race ) svy, subpop(adult): mean HI_CHOL, stdize(agecat) stdweight(agewt) over( race RIAGENDR ) [[alternative HTML version deleted]]
Thomas Lumley
2012-Aug-13 04:27 UTC
[R] Direct Method Age-Adjustment to Complex Survey Data
On Sat, Aug 11, 2012 at 5:53 AM, Anthony Damico <ajdamico at gmail.com> wrote:> Hi everyone, my apologies in advance if I'm overlooking something simple in > this question. I am trying to use R's survey package to make a direct > method age-adjustment to some complex survey data. I have played with > postStratify, calibrate, rake, and simply multiplying the base weights by > the correct proportions - nothing seems to hit the published numbers on the > nose.<snip>> # but matching the figure exactly requires an exact age adjustment. > > # create the population types vector > pop.types <- > data.frame( > agecat = 0:3 , > Freq = c( 55901 , 77670 , 72816 , 45364 ) > ) > > > z.postStratified <- postStratify( z , ~agecat , pop.types , partial = T )The standardization in the CDC examples is within each subpopulation. That is, they standardise each race/ethnicity group to the Census age structure, rather than standardising the whole population. That's the whole point -- they want to look at an imaginary population where age and race aren't confounded. When I do this, it almost exactly matches. The next step was to drop all the missing data and reweight just the non-missing data. That works exactly. (I also think you have the wrong recoding of RIDRETH1). demog<-read.xport("~/Downloads/demo_f.xpt") chol<-read.xport("~/Downloads/TCHOL_f.xpt") alldata<-merge(demog,chol) alldata<-subset(alldata, RIDSTATR %in% 2) alldata<-transform(alldata, HI_CHOL = ifelse(LBXTC>=240,1,0)) alldata<-transform(alldata, race=c(1,1,2,3,4)[RIDRETH1]) alldata<-transform(alldata, agecat=cut(RIDAGEYR,c(0,19,39,59, Inf))) popage<-c(55901,77670,72816,45364) racegender<-as.data.frame(svytable(~race+RIAGENDR,design)) racegenderage<-expand.grid(race=1:4,RIAGENDR=1:2,agecat=levels(alldata$agecat)) racegenderage$Freq<- as.vector(outer(racegender$Freq, popage/sum(popage))) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA,nest=TRUE,weights=~WTMEC2YR,data=alldata) svyby(~HI_CHOL,~race+RIAGENDR,design=subset(postStratify(design,~race+RIAGENDR+agecat,racegenderage),RIDAGEYR>=20),svymean,na.rm=TRUE) somedata<-subset(alldata, !is.na(LBXTC)) design1 <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA,nest=TRUE,weights=~WTMEC2YR,data=somedata) svyby(~HI_CHOL,~race+RIAGENDR,design=subset(postStratify(design1,~race+RIAGENDR+agecat,racegenderage),RIDAGEYR>=20),svymean,na.rm=TRUE) -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland