A.F.Fenton at lse.ac.uk
2012-Apr-13 17:44 UTC
[R] problem with svyby and NAs (survey package)
Hello I'm trying to get the proportion "true" for dichotomous variable for various subgroups in a survey. This works fine, but obviously doesn't give proportions directly: svytable(~SurvYear+problem.vandal, seh.dsn, round=TRUE) problem.vandal SurvYear FALSE TRUE 1995 8906 786 1997 17164 2494 1998 17890 1921 1999 18322 1669 2001 17623 2122 ... Note some years are missing - they are part of the dataset, but all responses are NA (the question wasn't asked). However, this gives an error, and I'd like to understand why - it works for variables without missing years: svyby(~problem.vandal, ~SurvYear, seh.dsn, svymean, na.rm=TRUE) Error in tapply(1:NROW(x), list(factor(strata)), function(index) { : arguments must have same length The error only occurs when na.rm=TRUE and there are no observations in one year. Thanks alex Please access the attached hyperlink for an important electronic communications disclaimer: http://lse.ac.uk/emailDisclaimer
A.F.Fenton at lse.ac.uk
2012-Apr-13 19:17 UTC
[R] problem with svyby and NAs (survey package)
> I'm trying to get the proportion "true" for dichotomous variable forvarious> subgroups in a survey.Sorry, I'm new to the list, and just saw the advice about minimally reproducible code. Here goes: library(survey) foo <- data.frame(id = 1:25, weight = runif(25), year = rep(2002:2006, 5), problem = rnorm(25) > 0) foo.dsn = svydesign(id=~id, weight=~weight, data=foo) svyby(~problem, ~year, foo.dsn, svymean, na.rm=TRUE) # Fine # One year is missing foo[foo$year == 2004, "problem"] = NA foo.dsn = svydesign(id=~id, weight=~weight, data=foo) svyby(~problem, ~year, foo.dsn, svymean, na.rm=TRUE) # Error thanks alex Please access the attached hyperlink for an important electronic communications disclaimer: http://lse.ac.uk/emailDisclaimer
On Sat, Apr 14, 2012 at 5:44 AM, <A.F.Fenton at lse.ac.uk> wrote:> Hello > > I'm trying to get the proportion "true" for dichotomous variable for > various subgroups in a survey. > > This works fine, but obviously doesn't give proportions directly: > svytable(~SurvYear+problem.vandal, seh.dsn, round=TRUE) > ? ? ? ?problem.vandal > SurvYear FALSE ?TRUE > ? ?1995 ?8906 ? 786 > ? ?1997 17164 ?2494 > ? ?1998 17890 ?1921 > ? ?1999 18322 ?1669 > ? ?2001 17623 ?2122 > ... > > Note some years are missing - they are part of the dataset, but all > responses are NA (the question wasn't asked). > > However, this gives an error, and I'd like to understand why - it works > for variables without missing years: > > svyby(~problem.vandal, ~SurvYear, seh.dsn, svymean, na.rm=TRUE) > Error in tapply(1:NROW(x), list(factor(strata)), function(index) { : > ?arguments must have same length > > The error only occurs when na.rm=TRUE and there are no observations in > one year.The error occurs because you are asking for the mean of a vector of all NAs. svyby() just calls svymean() on each subset of the data. In your reproducible example, svymean(~problem, subset(foo.dsn, year==2004), na.rm=TRUE) will give the same error, and a work-around is to use subset(foo.dsn, year!=2004) in the call to svyby() Now, svymean() is entitled to be a bit upset: you asked for the mean of the all the non-missing values, but you didn't give it any non-missing values. What should it do? It obviously can't return a sensible proportion, because it got given no data. It could just return NaN as the answer, as mean() does, but that wouldn't help you here since svyby() is expecting a vector of two proportions and a covariance matrix for them. Obviously it would be possible to rewrite svymean() to handle empty data, and I'll do that, but that doesn't solve the general problem of what happens when svyby() asks for something impossible. It would also be possible for svyby() to trap errors and treat them as empty results, but that would have the disadvantage of making debugging a lot harder. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland