Dear List: I am working to understand some differences between the results of the svymean() function in the survey package and from code I have written myself. The results from svymean() also agree with results I get from SAS proc surveymeans, so, this suggests I am misunderstanding something. I am never comfortable with "I did what the software" does mentality, so I am working to fully replicate the methods to make sure I know what is going on and then I can feel better about software. Below I provide three things for this conversation should anyone engage. First I provide an example of R code where my code replicates svymean() exactly. Second, I provide an example where the same code is used but the results of my code and svymean() do not agree. Last, at the very bottom of this is a minimal LaTeX code that you can compile to see my derivation of the variance estimator using the taylor series expansion. My code reflects this derivation. ### Below is an example where my code and svymean agree. ### My code implements the taylor series reflected in the LaTeX below ### These data are balanced in that there are the same number of individuals per cluster ### Create correlated data set.seed(100) x <- rnorm(10) y <- rep(10,10) z <- rep(x,y) score <- rnorm(100) + z mm <- data.frame(group = gl(10,10), score) ### Use svymean to get results gg <- svydesign(ids = ~group, data = mm) svymean(~score, gg, deff='replace', na.rm=TRUE) ### My code which replicates the formulae in the LaTeX code below ## Cluster totals xx <- with(mm, tapply(score, group, sum, na.rm=TRUE)) ## Mean of the total ss <- mean(xx, na.rm=T) ## get total variance k <- length(xx) totvar <- sum((xx-ss)^2) * k/(k-1) ## The SE is then N <- nrow(mm) sqrt(totvar/N^2) One additional thing to point out is this: ## sqrt of variance of the total is:> svytotal(~score, gg)total SE score -1.7923 21.139 ## sqrt of variance of total from my code> sqrt(totvar)[1] 21.13849 ### Now, the code below presents only the output as this is from ### a data file I am working with. ### I can send data to anyone interested ### It is the same code as above, just applied to a data> gg <- svydesign(ids = ~CSRSSchoolCode, data = qq)Warning message: In svydesign.default(ids = ~CSRSSchoolCode, data = qq) : No weights or probabilities supplied, assuming equal probability> svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE)mean SE DEff OSPI25632 0.84250 0.01072 1.0384> > ### My code which replicates the formulae in the LaTeX code below > ## Cluster totals > xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE)) > ## Mean of the total > ss <- mean(xx, na.rm=T) > ## get total variance > k <- length(xx) > totvar <- sum((xx-ss)^2) * k/(k-1) > ## The SE is then > N <- nrow(qq) > sqrt(totvar/N^2)[1] 0.02158459 Now, we can see the SE's are different. But, look at the following:> svytotal(~OSPI25632, gg)total SE OSPI25632 1011 25.901> sqrt(totvar)[1] 25.90151 The sqrt of the variance of the total is the same as what my code produces. But, because we get a different SE the only difference I think is in the denominator, which is the N^2. Now, in my data the N size is:> nrow(qq)[1] 1200 But, we can do a little algebra and see what N size is being used by svymean to get the SE as # the numerator is the sqrt of the variance of the total from svytotal and the denominator is # the se returned by svymean 25.901/.01072 [1] 2416.138 So, if we take 2416.138 as the N size I can replicate the svymean standard error.> 25.901/2416.138[1] 0.01072 But, this is not the N size. The total N size is 1200 and the cluster size is 519 in the observed data. So, as far as I can tell, it seems svymean() is using a different N size that what I think the formula from the taylor series expansion would dictate. Now, this is what I cannot seem to reconcile. I am certain there is either another difference somewhere or a good reason why this is occuring, I just can't seem to get my head aorund the reason and would appreciate anyone with insight into this problem. SessionInfo and LaTex are below. Many thanks, Harold Below is the minimal LaTeX code for additional transparency. \documentclass[12pt]{article} \usepackage{bm,geometry} \begin{document} First define the ratio estimator of the mean as: \begin{equation} f(Y) = \frac{Y}{N} \end{equation} \noindent where $Y$ is the total of the variable and $N$ is the population size. A first-order taylor series expansion of the ratio estimator $f(Y)$ can be used to derive the design-consistent variance estimator as: \begin{equation} var(f(Y)) = \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y) \end{equation} \noindent where \begin{equation} \left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N} \end{equation} \begin{equation} var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2 \end{equation} \begin{equation} \hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} \end{equation} \begin{equation} \hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j \end{equation} \noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes the $i$th member of cluster $j$, and $n_j$ is the total number of members in cluster $j$. The standard error is then taken as: \begin{equation} se = \sqrt{\frac{var(f(Y))}{N^2}} \end{equation} \end{document}> sessionInfo()R version 2.7.1 (2008-06-23) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] survey_3.8 loaded via a namespace (and not attached): [1] tools_2.7.1
Dear List: (reposting due to word wrap problems, my apologies) I am working to understand some differences between the results of the svymean() function in the survey package and from code I have written myself. The results from svymean() also agree with results I get from SAS proc surveymeans, so, this suggests I am misunderstanding something. I am never comfortable with "I did what the software" does mentality, so I am working to fully replicate the methods to make sure I know what is going on and then I can feel better about software. Below I provide three things for this conversation should anyone engage. First I provide an example of R code where my code replicates svymean() exactly. Second, I provide an example where the same code is used but the results of my code and svymean() do not agree. Last, at the very bottom of this is a minimal LaTeX code that you can compile to see my derivation of the variance estimator using the taylor series expansion. My code reflects this derivation. ### Below is an example where my code and svymean agree. ### My code implements the taylor series reflected in the LaTeX below ### These data are balanced in that there are the same number of individuals per cluster ### Create correlated data set.seed(100) x <- rnorm(10) y <- rep(10,10) z <- rep(x,y) score <- rnorm(100) + z mm <- data.frame(group = gl(10,10), score) ### Use svymean to get results gg <- svydesign(ids = ~group, data = mm) svymean(~score, gg, deff='replace', na.rm=TRUE) ### My code which replicates the formulae in the LaTeX code below ## Cluster totals xx <- with(mm, tapply(score, group, sum, na.rm=TRUE)) ## Mean of the total ss <- mean(xx, na.rm=T) ## get total variance k <- length(xx) totvar <- sum((xx-ss)^2) * k/(k-1) ## The SE is then N <- nrow(mm) sqrt(totvar/N^2) If you rnu this you can see the results are the same between my code and svymean(). One additional thing to point out is this: ## sqrt of variance of the total is:> svytotal(~score, gg)total SE score -1.7923 21.139 ## sqrt of variance of total from my code> sqrt(totvar)[1] 21.13849 That is, the sqrt of the variance of the total from svytotal and my code are the same. ### Now, the code below presents only the output as this is from ### a data file I am working with. ### I can send data to anyone interested ### It is the same code as above, just applied to a data> gg <- svydesign(ids = ~CSRSSchoolCode, data = qq)Warning message: In svydesign.default(ids = ~CSRSSchoolCode, data = qq) : No weights or probabilities supplied, assuming equal probability> svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE)mean SE DEff OSPI25632 0.84250 0.01072 1.0384> > ### My code which replicates the formulae in the LaTeX code below## Cluster totals xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE)) ## Mean of the total ss <- mean(xx, na.rm=T) ## get total variance k <- length(xx) totvar <- sum((xx-ss)^2) * k/(k-1) ## The SE is then N <- nrow(qq)> sqrt(totvar/N^2)[1] 0.02158459 Now, we can see the SE's are different. But, look at the following:> svytotal(~OSPI25632, gg)total SE OSPI25632 1011 25.901> sqrt(totvar)[1] 25.90151 The sqrt of the variance of the total is the same as what my code produces. But, because we get a different SE the only difference I think is in the denominator, which is the N^2. Now, in my data the N size is:> nrow(qq)[1] 1200 But, we can do a little algebra and see what N size is being used by svymean to get the SE as # the numerator is the sqrt of the variance of the total from svytotal and the denominator is # the se returned by svymean 25.901/.01072 [1] 2416.138 So, if we take 2416.138 as the N size I can replicate the svymean standard error.> 25.901/2416.138[1] 0.01072 But, this is not the N size. The total N size is 1200 and the cluster size is 519 in the observed data. So, as far as I can tell, it seems svymean() is using a different N size that what I think the formula from the taylor series expansion would dictate. Now, this is what I cannot seem to reconcile. I am certain there is either another difference somewhere or a good reason why this is occuring, I just can't seem to get my head around the reason and would appreciate anyone with insight into this problem. SessionInfo and LaTex are below. Many thanks, Harold Below is the minimal LaTeX code for additional transparency. \documentclass[12pt]{article} \usepackage{bm,geometry} \begin{document} First define the ratio estimator of the mean as: \begin{equation} f(Y) = \frac{Y}{N} \end{equation} \noindent where $Y$ is the total of the variable and $N$ is the population size. A first-order taylor series expansion of the ratio estimator $f(Y)$ can be used to derive the design-consistent variance estimator as: \begin{equation} var(f(Y)) = \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y) \end{equation} \noindent where \begin{equation} \left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N} \end{equation} \begin{equation} var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2 \end{equation} \begin{equation} \hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} \end{equation} \begin{equation} \hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j \end{equation} \noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes the $i$th member of cluster $j$, and $n_j$ is the total number of members in cluster $j$. The standard error is then taken as: \begin{equation} se = \sqrt{\frac{var(f(Y))}{N^2}} \end{equation} \end{document}> sessionInfo()R version 2.7.1 (2008-06-23) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] survey_3.8 loaded via a namespace (and not attached): [1] tools_2.7.1
Harold, in design-based estimation, thinking in terms of "what is my (effective) sample size" rarely works out. First of all, unless you have a fixed sample size design, your sample size itself is a random variable. You can hope for fixed sample sizes with some excruciatingly controlled clinical studies, but with most other surveys, you are at the mercy of non-response, unknown cluster sizes, interviewer availability, all sorts of field problems. So in ratio estimation (and estimation of the mean is ratio estimation, mean[Y] = total[Y]/total[1]), your standard error should control for randomness in the sample size, so your Taylor series linearization formula should have the variance of the denominator, and then also correlation between cluster totals of Y's and 1's. Second, you probably have different cluster/PSU sizes. That's actually what contributes to variability of total[1]. But at any rate that variability invalidates simple formulae for balanced PSU sizes that your code is using. Third, at least theoretically, there might be finite population corrections, although you don't seem to specify any in your svydesign definition. And frankly I've never seen a survey where weights were not needed. If you want to take a look at some references, Korn & Graubard 1999 (http://www.citeulike.org/user/ctacmo/article/553280) might be a good starting point, they have a pretty thorough discussion of issues with variance estimation in cluster samples. A more technical reading is Thompson 1997 (http://www.citeulike.org/user/ctacmo/article/1036973). On 8/15/08, Doran, Harold <HDoran at air.org> wrote:> Dear List: > > I am working to understand some differences between the results of the > svymean() function in the survey package and from code I have written > myself. The results from svymean() also agree with results I get from > SAS proc surveymeans, so, this suggests I am misunderstanding something. >-- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only.