thr3ads.net - R help - [R] Design-consistent variance estimate [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2008-Aug-15 15:00 UTC

[R] Design-consistent variance estimate

Dear List:

I am working to understand some differences between the results of the
svymean() function in the survey package and from code I have written
myself. The results from svymean() also agree with results I get from
SAS proc surveymeans, so, this suggests I am misunderstanding something.


I am never comfortable with "I did what the software" does mentality,
so
I am working to fully replicate the methods to make sure I know what is
going on and then I can feel better about software. 

Below I provide three things for this conversation should anyone engage.
First I provide an example of R code where my code replicates svymean()
exactly. Second, I provide an example where the same code is used but
the results of my code and svymean() do not agree. Last, at the very
bottom of this is a minimal LaTeX code that you can compile to see my
derivation of the variance estimator using the taylor series expansion.
My code reflects this derivation.

### Below is an example where my code and svymean agree. 
### My code implements the taylor series reflected in the LaTeX below
### These data are balanced in that there are the same number of
individuals per cluster
### Create correlated data
set.seed(100)
x <- rnorm(10)
y <- rep(10,10)
z <- rep(x,y)
score <- rnorm(100) + z
mm <- data.frame(group = gl(10,10), score)

### Use svymean to get results
gg <- svydesign(ids = ~group, data = mm)
svymean(~score, gg, deff='replace', na.rm=TRUE) 

### My code which replicates the formulae in the LaTeX code below
## Cluster totals
xx <- with(mm, tapply(score, group, sum, na.rm=TRUE))
## Mean of the total
ss <- mean(xx, na.rm=T)
## get total variance
k <- length(xx)
totvar <- sum((xx-ss)^2) * k/(k-1)
## The SE is then
N <- nrow(mm)
sqrt(totvar/N^2)

One additional thing to point out is this:
## sqrt of variance of the total is: > svytotal(~score, gg)        total     SE
score -1.7923 21.139

## sqrt of variance of total from my code> sqrt(totvar)[1] 21.13849

### Now, the code below presents only the output as this is from 
### a data file I am working with.
### I can send data to anyone interested
### It is the same code as above, just applied to a data> gg <- svydesign(ids = ~CSRSSchoolCode, data = qq)Warning message:
In svydesign.default(ids = ~CSRSSchoolCode, data = qq) :
  No weights or probabilities supplied, assuming equal
probability> svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE)              mean      SE   DEff
OSPI25632 0.84250 0.01072 1.0384> 
> ### My code which replicates the formulae in the LaTeX code below
> ## Cluster totals
> xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE))
> ## Mean of the total
> ss <- mean(xx, na.rm=T)
> ## get total variance
> k <- length(xx)
> totvar <- sum((xx-ss)^2) * k/(k-1)
> ## The SE is then
> N <- nrow(qq)
> sqrt(totvar/N^2)[1] 0.02158459

Now, we can see the SE's are different. But, look at the following:
> svytotal(~OSPI25632, gg)          total     SE
OSPI25632  1011 25.901> sqrt(totvar)[1] 25.90151

The sqrt of the variance of the total is the same as what my code
produces. But, because we get a different SE the only difference I think
is in the denominator, which is the N^2. 

Now, in my data the N size is:
> nrow(qq)[1] 1200

But, we can do a little algebra and see what N size is being used by
svymean to get the SE as

# the numerator is the sqrt of the variance of the total from svytotal
and the denominator is
# the se returned by svymean
25.901/.01072 
[1] 2416.138

So, if we take 2416.138 as the N size I can replicate the svymean
standard error.> 25.901/2416.138[1] 0.01072

But, this is not the N size. The total N size is 1200 and the cluster
size is 519 in the observed data. So, as far as I can tell, it seems
svymean() is using a different N size that what I think the formula from
the taylor series expansion would dictate. Now, this is what I cannot
seem to reconcile. I am certain there is either another difference
somewhere or a good reason why this is occuring, I just can't seem to
get my head aorund the reason and would appreciate anyone with insight
into this problem. 

SessionInfo and LaTex are below.

Many thanks,
Harold


Below is the minimal LaTeX code for additional transparency.

\documentclass[12pt]{article}
\usepackage{bm,geometry}
\begin{document}

First define the ratio estimator of the mean as:

\begin{equation}
f(Y) = \frac{Y}{N}
\end{equation}

\noindent where $Y$ is the total of the variable and $N$ is the
population size. A first-order taylor series expansion of the ratio
estimator $f(Y)$ can be used to derive the design-consistent variance
estimator as:

\begin{equation}
var(f(Y))  =  \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y)
\end{equation}

\noindent where

\begin{equation}
\left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N}
\end{equation}

\begin{equation}
var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
\end{equation}

\begin{equation}
\hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)}
\end{equation}

\begin{equation}
\hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j
\end{equation}

\noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes
the $i$th member of cluster $j$, and $n_j$ is the total number of
members in cluster $j$. The standard error is then taken as:

\begin{equation}
se = \sqrt{\frac{var(f(Y))}{N^2}}
\end{equation}

\end{document}


> sessionInfo()R version 2.7.1 (2008-06-23) 
i386-pc-mingw32 

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base


other attached packages:
[1] survey_3.8

loaded via a namespace (and not attached):
[1] tools_2.7.1

Doran, Harold

2008-Aug-15 16:32 UTC

head link

[R] Design-consistent variance estimate

Dear List:

(reposting due to word wrap problems, my apologies)

I am working to understand some differences between the results of the
svymean() function in the survey package and from code I have written
myself. The results from svymean() also agree with results I get from
SAS proc surveymeans, so, this suggests I am misunderstanding something.


I am never comfortable with "I did what the software" does mentality,
so
I am working to fully replicate the methods to make sure I know what is
going on and then I can feel better about software. 

Below I provide three things for this conversation should anyone engage.
First I provide an example of R code where my code replicates svymean()
exactly. Second, I provide an example where the same code is used but
the results of my code and svymean() do not agree. Last, at the very
bottom of this is a minimal LaTeX code that you can compile to see my
derivation of the variance estimator using the taylor series expansion.
My code reflects this derivation.

### Below is an example where my code and svymean agree. 
### My code implements the taylor series reflected in the LaTeX below
### These data are balanced in that there are the same number of
individuals per cluster 
### Create correlated data
set.seed(100)
x <- rnorm(10)
y <- rep(10,10)
z <- rep(x,y)
score <- rnorm(100) + z
mm <- data.frame(group = gl(10,10), score)

### Use svymean to get results
gg <- svydesign(ids = ~group, data = mm) 
svymean(~score, gg, deff='replace', na.rm=TRUE) 

### My code which replicates the formulae in the LaTeX code below 
## Cluster totals 
xx <- with(mm, tapply(score, group, sum, na.rm=TRUE)) 
## Mean of the total 
ss <- mean(xx, na.rm=T) 
## get total variance 
k <- length(xx) 
totvar <- sum((xx-ss)^2) * k/(k-1) 
## The SE is then 
N <- nrow(mm)
sqrt(totvar/N^2)

If you rnu this you can see the results are the same between my code and
svymean().

One additional thing to point out is this:
## sqrt of variance of the total is: > svytotal(~score, gg)        total     SE
score -1.7923 21.139

## sqrt of variance of total from my code> sqrt(totvar)[1] 21.13849

That is, the sqrt of the variance of the total from svytotal and my code
are the same.

### Now, the code below presents only the output as this is from 
### a data file I am working with.
### I can send data to anyone interested 
### It is the same code as above, just applied to a data> gg <- svydesign(ids = ~CSRSSchoolCode, data = qq)Warning message:
In svydesign.default(ids = ~CSRSSchoolCode, data = qq) :
  No weights or probabilities supplied, assuming equal
probability> svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE)             mean      SE   DEff
OSPI25632 0.84250 0.01072 1.0384> 
> ### My code which replicates the formulae in the LaTeX code below ## Cluster totals 
xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE)) 
## Mean of the total 
ss <- mean(xx, na.rm=T) 
## get total variance 
k <- length(xx) 
totvar <- sum((xx-ss)^2) * k/(k-1) 
## The SE is then 
N <- nrow(qq)> sqrt(totvar/N^2)[1] 0.02158459

Now, we can see the SE's are different. But, look at the following:
> svytotal(~OSPI25632, gg)          total     SE
OSPI25632  1011 25.901> sqrt(totvar)[1] 25.90151

The sqrt of the variance of the total is the same as what my code
produces. But, because we get a different SE the only difference I think
is in the denominator, which is the N^2. 

Now, in my data the N size is:
> nrow(qq)[1] 1200

But, we can do a little algebra and see what N size is being used by
svymean to get the SE as

# the numerator is the sqrt of the variance of the total from svytotal
and the denominator is 
# the se returned by svymean
25.901/.01072
[1] 2416.138

So, if we take 2416.138 as the N size I can replicate the svymean
standard error.> 25.901/2416.138[1] 0.01072

But, this is not the N size. The total N size is 1200 and the cluster
size is 519 in the observed data. So, as far as I can tell, it seems
svymean() is using a different N size that what I think the formula from
the taylor series expansion would dictate. Now, this is what I cannot
seem to reconcile. I am certain there is either another difference
somewhere or a good reason why this is occuring, I just can't seem to
get my head around the reason and would appreciate anyone with insight
into this problem. 

SessionInfo and LaTex are below.

Many thanks,
Harold


Below is the minimal LaTeX code for additional transparency.

\documentclass[12pt]{article}
\usepackage{bm,geometry}
\begin{document}

First define the ratio estimator of the mean as:

\begin{equation}
f(Y) = \frac{Y}{N}
\end{equation}

\noindent where $Y$ is the total of the variable and $N$ is the
population size. A first-order taylor series expansion of the ratio
estimator $f(Y)$ can be used to derive the design-consistent variance
estimator as:

\begin{equation}
var(f(Y))  =  \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y) 
\end{equation}

\noindent where

\begin{equation}
\left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N} 
\end{equation}

\begin{equation}
var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
\end{equation}

\begin{equation}
\hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} 
\end{equation}

\begin{equation}
\hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j 
\end{equation}

\noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes
the $i$th member of cluster $j$, and $n_j$ is the total number of
members in cluster $j$. The standard error is then taken as:

\begin{equation}
se = \sqrt{\frac{var(f(Y))}{N^2}}
\end{equation}

\end{document}


> sessionInfo()R version 2.7.1 (2008-06-23)
i386-pc-mingw32 

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base


other attached packages:
[1] survey_3.8

loaded via a namespace (and not attached):
[1] tools_2.7.1

Stas Kolenikov

2008-Aug-15 19:31 UTC

head link

[R] Design-consistent variance estimate

Harold,

in design-based estimation, thinking in terms of "what is my
(effective) sample size" rarely works out.

First of all, unless you have a fixed sample size design, your sample
size itself is a random variable. You can hope for fixed sample sizes
with some excruciatingly controlled clinical studies, but with most
other surveys, you are at the mercy of non-response, unknown cluster
sizes, interviewer availability, all sorts of field problems. So in
ratio estimation (and estimation of the mean is ratio estimation,
mean[Y] = total[Y]/total[1]), your standard error should control for
randomness in the sample size, so your Taylor series linearization
formula should have the variance of the denominator, and then also
correlation between cluster totals of Y's and 1's.

Second, you probably have different cluster/PSU sizes. That's actually
what contributes to variability of total[1]. But at any rate that
variability invalidates simple formulae for balanced PSU sizes that
your code is using.

Third, at least theoretically, there might be finite population
corrections, although you don't seem to specify any in your svydesign
definition. And frankly I've never seen a survey where weights were
not needed.

If you want to take a look at some references, Korn & Graubard 1999
(http://www.citeulike.org/user/ctacmo/article/553280) might be a good
starting point, they have a pretty thorough discussion of issues with
variance estimation in cluster samples. A more technical reading is
Thompson 1997 (http://www.citeulike.org/user/ctacmo/article/1036973).

On 8/15/08, Doran, Harold <HDoran at air.org>
wrote:> Dear List:
>
>  I am working to understand some differences between the results of the
>  svymean() function in the survey package and from code I have written
>  myself. The results from svymean() also agree with results I get from
>  SAS proc surveymeans, so, this suggests I am misunderstanding something.
>
-- 
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

Reasonably Related Threads

Search for more seemingly similar threads

R help - Aug 2008 - Design-consistent variance estimate

[R] Design-consistent variance estimate

[R] Design-consistent variance estimate

[R] Design-consistent variance estimate

Reasonably Related Threads