Jan van der Laan
2010-Aug-25 11:50 UTC
[R] Surprising behaviour survey-package with missing values
Dear list, I got some surprising results when using the svytotal routine from the survey package with data containing missing values. Some example code demonstrating the behaviour is included below. I have a stratified sampling design where I want to estimate the total income. In some strata some of the incomes are missing. I want to ignore these missing incomes. I would have expected that svytotal(~income, design=mydesign, na.rm=TRUE) would do the trick. However, when calculating the estimates 'by hand' the estimates were different from those obtained from svytotal. The estimated mean incomes do agree with each other. It seems that using the na.rm option with svytotal is the same as replacing the missing values with zero's, which is not what I would have expected, especially since this behaviour seems to differ from that of svymean. Is there a reason for this behaviour? I can of course remove the missing values myself before creating the survey object. However, with many different variables with different missing values, this is not very practical. Is there an easy way to get the behaviour I want? Thanks for your help. With regards, Jan van der Laan === EXAMPLE == library(survey) library(plyr) # generate some data data <- data.frame( id = 1:20, stratum = rep(c("a", "b"), each=10), income = rnorm(20, 100), n = rep(c(100, 200), each=10) ) data$income[5] <- NA # calculate mean and total income for every stratum using survey package des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n) svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE) mn <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE) mn n <- svyby(~n, by=~stratum, FUN=svymean, design=des) # total does not equal mean times number of persons in stratum mn[2] * n[2] # calculate mean and total income 'by hand'. This does not give the same total # as svytotal, but it does give the same mean ddply(data, .(stratum), function(d) { data.frame( mean = mean(d$income, na.rm=TRUE), n = mean(d$n), total = mean(d$income, na.rm=TRUE) * mean(d$n) ) }) # when we set income to 0 for missing cases and repeat the previous estimation # we get the same answer as svytotal (but not svymean) data2 <- data data2$income[is.na(data$income )] <- 0 ddply(data2, .(stratum), function(d) { data.frame( mean = mean(d$income, na.rm=TRUE), n = mean(d$n), total = mean(d$income, na.rm=TRUE) * mean(d$n) ) })
Jan van der LAan
2010-Aug-26 08:05 UTC
[R] Surprising behaviour survey-package with missing values
[I already sent this message to the list almost a day ago using an other email address, but that message does not seem to get through. In case it finally does get through, I appologize for the the double posting] Dear list, I got some surprising results when using the svytotal routine from the survey package with data containing missing values. Some example code demonstrating the behaviour is included below. I have a stratified sampling design where I want to estimate the total income. In some strata some of the incomes are missing. I want to ignore these missing incomes. I would have expected that svytotal(~income, design=mydesign, na.rm=TRUE) would do the trick. However, when calculating the estimates 'by hand' the estimates were different from those obtained from svytotal. The estimated mean incomes do agree with each other. It seems that using the na.rm option with svytotal is the same as replacing the missing values with zero's, which is not what I would have expected, especially since this behaviour seems to differ from that of svymean. Is there a reason for this behaviour? I can of course remove the missing values myself before creating the survey object. However, with many different variables with different missing values, this is not very practical. Is there an easy way to get the behaviour I want? Thanks for your help. With regards, Jan van der Laan === EXAMPLE == library(survey) library(plyr) # generate some data data <- data.frame( id = 1:20, stratum = rep(c("a", "b"), each=10), income = rnorm(20, 100), n = rep(c(100, 200), each=10) ) data$income[5] <- NA # calculate mean and total income for every stratum using survey package des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n) svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE) mn <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE) mn n <- svyby(~n, by=~stratum, FUN=svymean, design=des) # total does not equal mean times number of persons in stratum mn[2] * n[2] # calculate mean and total income 'by hand'. This does not give the same total # as svytotal, but it does give the same mean ddply(data, .(stratum), function(d) { data.frame( mean = mean(d$income, na.rm=TRUE), n = mean(d$n), total = mean(d$income, na.rm=TRUE) * mean(d$n) ) }) # when we set income to 0 for missing cases and repeat the previous estimation # we get the same answer as svytotal (but not svymean) data2 <- data data2$income[is.na(data$income )] <- 0 ddply(data2, .(stratum), function(d) { data.frame( mean = mean(d$income, na.rm=TRUE), n = mean(d$n), total = mean(d$income, na.rm=TRUE) * mean(d$n) ) })