thr3ads.net - R help - [R] Surprising behaviour survey-package with missing values [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Jan van der Laan

2010-Aug-25 11:50 UTC

[R] Surprising behaviour survey-package with missing values

Dear list,

I got some surprising results when using the svytotal routine from the
survey package with data containing missing values.

Some example code demonstrating the behaviour is included below.

I have a stratified sampling design where I want to estimate the total
income. In some strata some of the incomes are missing. I want to
ignore these missing incomes. I would have expected that
svytotal(~income, design=mydesign, na.rm=TRUE) would do the trick.
However, when calculating the estimates 'by hand' the estimates were
different from those obtained from svytotal. The estimated mean
incomes do agree with each other. It seems that using the na.rm option
with svytotal is the same as replacing the missing values with zero's,
which is not what I would have expected, especially since this
behaviour seems to differ from that of svymean. Is there a reason for
this behaviour?

I can of course remove the missing values myself before creating the
survey object. However, with many different variables with different
missing values, this is not very practical. Is there an easy way to
get the behaviour I want?

Thanks for your help.

With regards,

Jan van der Laan


=== EXAMPLE ==
library(survey)
library(plyr)

# generate some data
data <- data.frame(
  id = 1:20,
  stratum = rep(c("a", "b"), each=10),
  income = rnorm(20, 100),
  n = rep(c(100, 200), each=10)
  )
data$income[5] <- NA

# calculate mean and total income for every stratum using survey package
des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n)
svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE)
mn  <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE)
mn
n   <- svyby(~n, by=~stratum, FUN=svymean, design=des)

# total does not equal mean times number of persons in stratum
mn[2] * n[2]

# calculate mean and total income 'by hand'. This does not give the same
total
# as svytotal, but it does give the same mean
ddply(data, .(stratum), function(d) {
    data.frame(
      mean = mean(d$income, na.rm=TRUE),
      n = mean(d$n),
      total = mean(d$income, na.rm=TRUE) * mean(d$n)
    )
  })

# when we set income to 0 for missing cases and repeat the previous estimation
# we get the same answer as svytotal (but not svymean)
data2 <- data
data2$income[is.na(data$income )] <- 0
ddply(data2, .(stratum), function(d) {
    data.frame(
      mean = mean(d$income, na.rm=TRUE),
      n = mean(d$n),
      total = mean(d$income, na.rm=TRUE) * mean(d$n)
    )
  })

Jan van der LAan

2010-Aug-26 08:05 UTC

head link

[R] Surprising behaviour survey-package with missing values

[I already sent this message to the list almost a day ago using an  
other email address, but that message does not seem to get through. In  
case it finally does get through, I appologize for the the double  
posting]

Dear list,

I got some surprising results when using the svytotal routine from the  
survey package with data containing missing values.

Some example code demonstrating the behaviour is included below.

I have a stratified sampling design where I want to estimate the total  
income. In some strata some of the incomes are missing. I want to  
ignore these missing incomes. I would have expected that  
svytotal(~income, design=mydesign, na.rm=TRUE) would do the trick.  
However, when calculating the estimates 'by hand' the estimates were  
different from those obtained from svytotal. The estimated mean  
incomes do agree with each other. It seems that using the na.rm option  
with svytotal is the same as replacing the missing values with zero's,  
which is not what I would have expected, especially since this  
behaviour seems to differ from that of svymean. Is there a reason for  
this behaviour?

I can of course remove the missing values myself before creating the  
survey object. However, with many different variables with different  
missing values, this is not very practical. Is there an easy way to  
get the behaviour I want?

Thanks for your help.

With regards,

Jan van der Laan


=== EXAMPLE ==
library(survey)
library(plyr)

# generate some data
data <- data.frame(
  id = 1:20,
  stratum = rep(c("a", "b"), each=10),
  income = rnorm(20, 100),
  n = rep(c(100, 200), each=10)
  )
data$income[5] <- NA

# calculate mean and total income for every stratum using survey package
des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n)
svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE)
mn  <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE)
mn
n   <- svyby(~n, by=~stratum, FUN=svymean, design=des)

# total does not equal mean times number of persons in stratum
mn[2] * n[2]

# calculate mean and total income 'by hand'. This does not give the same
total
# as svytotal, but it does give the same mean
ddply(data, .(stratum), function(d) {
    data.frame(
      mean = mean(d$income, na.rm=TRUE),
      n = mean(d$n),
      total = mean(d$income, na.rm=TRUE) * mean(d$n)
    )
  })

# when we set income to 0 for missing cases and repeat the previous estimation
# we get the same answer as svytotal (but not svymean)
data2 <- data
data2$income[is.na(data$income )] <- 0
ddply(data2, .(stratum), function(d) {
    data.frame(
      mean = mean(d$income, na.rm=TRUE),
      n = mean(d$n),
      total = mean(d$income, na.rm=TRUE) * mean(d$n)
    )
  })

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Aug 2010 - Surprising behaviour survey-package with missing values

[R] Surprising behaviour survey-package with missing values

[R] Surprising behaviour survey-package with missing values

Apparently Analagous Threads