Maciej.Hoffman-Wecker@evotecoai.com
2001-Jun-07  14:41 UTC
[R] once more: methods on missing data
Thanks for replies, but i was not precise enough.
The problem is not evaluating statistics on data with NA values.
The problem is evaluation of statistics on data with length = 0.
To make the problem more clear this is what i tried:
This works fine:
     tapply(as.numeric(c(NA,2)), as.factor(c("a","b")),
summary)
But i need SDev, aswell, so i copied summary.default to my.summary and
changed only the line
        qq <- signif(c(qq[1:3], mean(object), qq[4:5]), digits)
        names(qq) <- c("Min.", "1st Qu.",
"Median", "Mean", "3rd Qu.",
"Max.")
to
     qq <- signif(c(qq[1:3], mean(object), qq[4:5], sd(object),
mad(object)), digits)
        names(qq) <- c("Min.", "1st Qu.",
"Median", "Mean", "3rd Qu.",
"Max.","SDev","MAD")
and
     tapply(as.numeric(c(NA,2)), as.factor(c("a","b")),
my.summary)
results in
     Error in var(x, na.rm = na.rm) : `x' is empty
I think this is a frequent problem. It results from the following.
The result of the call
     x <- as.numeric(c(NA,NA,NA)); STATISTIC(x[!is.na(x)])
depends on the STATISTIC.
     STATISTIC           RESULT
     min                 Inf and warning message
     max                 -Inf and warning message
     mean                NaN and no warning message
     quantile            named vector containing NAs and no warning message
     sd                  abortion of the evaluation with an error message
The breakup is more difficult to handle.
What i did is changing the var function. I changed
     .Internal(cov(x, y, na.method))
to
     z <- try(.Internal(cov(x, y, na.method)))
     if (inherits(z, "try-error")) return(as.numeric(NA))
     else return(z)
This works fine, but a solution within cov.c would be better, i think.
I try not to change standard source code on myself, as i don't know if
this has any consequences.
Should not the statistics generally return NA and a warning message?
Hope this is not a too marginal problem.
Maciej
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Thu, 7 Jun 2001 Maciej.Hoffman-Wecker at evotecoai.com wrote in part:> > The result of the call > > x <- as.numeric(c(NA,NA,NA)); STATISTIC(x[!is.na(x)]) > > depends on the STATISTIC. > > STATISTIC RESULT > min Inf and warning message > max -Inf and warning message > mean NaN and no warning message > quantile named vector containing NAs and no warning message > sd abortion of the evaluation with an error message ><snip>> Should not the statistics generally return NA and a warning message? >Ideally, they shouldn't. NA is missing data -- that is, we don't know the value of the statistic because some data were not measured. That's why, for example NA & FALSE is FALSE, not NA, because the value of the expression is known, no matter what the first operand is. The results for min() and max() have the rationale that eg max(a,max(b)) should return the same as max(a,b) even when b is empty. There's even some examples where this is genuinely helpful. If the others were to return a value I think NaN (undefined numerical result) would be better than NA (missing data), as is the case with mean(). This would argue for changing the return value of quantile() as well. However, I think it's reasonable for a function to refuse to calculate the variance of no data. We do have try() to handle errors if needed. -thomas Thomas Lumley Asst. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> The result of the call > > x <- as.numeric(c(NA,NA,NA)); STATISTIC(x[!is.na(x)]) > > depends on the STATISTIC. > > STATISTIC RESULT > min Inf and warning message > max -Inf and warning message > mean NaN and no warning message > quantile named vector containing NAs and no warning message > sd abortion of the evaluation with an error messageOK, now I see your problem. R 1.2.3 gives the following for your example:> sd(x[!is.na(x)])Error in var(x, na.rm = na.rm) : `x' is emptyplatform sparc-sun-solaris2.7 but this seems to be fixed in the current development version (the forthcoming R 1.3.0)> sd(x[!is.na(x)])[1] NA I'm not sure, what difference in the code is responsible for this but I hope this helps. My respective systems are given below. Achim --------------------------- Achim Zeileis Institut f?r Statistik Technische Universit?t Wien platform sparc-sun-solaris2.7 arch sparc os solaris2.7 system sparc, solaris2.7 status major 1 minor 2.3 year 2001 month 04 day 26 language R platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 1 minor 3.0 year 2001 month 03 day 20 language R -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._