thr3ads.net - R devel - [Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?! [May 2019]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2019-May-08 09:46 UTC

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

>>>>> Robert Almgren 
>>>>>     on Fri, 3 May 2019 15:45:44 -0400 writes[ __ to R-help __ -- here diverted to R-devel on purpose]

    > There is something I do not think is right in the approx()
    > function in base R, with method="constant" and in the
    > presence of NA values. I have 3.6.0, but the behavior
    > seems to be the same in earlier versions.

(of course; the behavior has been unchanged, and "as documented"
forever)

    > My suggested fix is to add an "na.rm" argument to approx(),
as in mean(). If this argument is FALSE, then NA values should be propagated
into the output rather than being removed.

    > Details:

    > The documentation says 

    > "f: for method = "constant" a number between 0 and 1
inclusive, indicating a compromise between left- and right-continuous step
functions. If y0 and y1 are the values to the left and right of the point then
the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate
values. In this way the result is right-continuous for f == 0 and
left-continuous for f == 1, even for non-finite y values."

    > This suggests to me that if the left value y0 is NA, and if f=0 (the
default), then the interpolated value should be NA. (Regardless of the right
value y1, see bug 15655 fixed in 2014.)

    > The documentation further says, below under "Details", that

    > "The inputs can contain missing values which are deleted."

    > The question is what is the appropriate behavior if one of the input
values y is NA. Currently, approx() seems to interpret NA values as faulty data
points, which should be deleted and the previous values carried forward (example
below).

Well, "appropriate behavior" does not depend on just what you
want in a given case:  approx() had been designed (for S, ca 40
years ago) and well documented to do what it does now, so
there's really no bug ...

... but read on

    > But in many applications, especially with "constant"
interpolation, an NA value is intended to mean that we really do not know the
value in the next interval, or explicitly that there is no value. Therefore the
NA should not be removed, but should be propagated forward into the output
within the corresponding interval.

    > The situation is similar with functions like mean(). The presence of an
NA value may mean either (a) we want to compute the mean without that value
(na.rm=TRUE), or (b) we really are missing important information, we cannot
determine the mean, and we should return NA (na.rm=FALSE).

    > Therefore, I propose that approx() also be given an na.rm argument,
indicating whether we wish to delete NA values, or treat them as actual values
on the corresponding interval. That option makes even more sense for approx()
than for mean(), since the NA values apply only on small regions of the data
range.

    > --Robert Almgren

I agree mostly with your thoughts above:  In some cases/applications, it would
be
useful and even "appropriate" to be able to use  'na.rm=FALSE'
and have "NA carried forward".

What we should *not* do, I think,  is to change the default
behavior, even though 'na.rm=FALSE' *is* the default in many
other R functions, including your example mean().

Usually, we would have asked you here to now  file a *wishlist*
report (as opposed to a *bug* report) on R's bugzilla
( https://bugs.r-project.org/ ) ... but I have had some time and
interest, so I've now spent a bit of time (> 2 hrs) digging and trying,
notably also in the underlying C code, and it seems your
"wishlist item" should be fulfillable without too much more effort.

My change also works correspondingly for the
default  method = "linear".

Actually, if we think a bit longer about this (and broaden our horizon),
we note that  approx / approxfun really are about degree 0
("constant") and degree 1 interpolation splines, and we should
eventually also think about what  'na.rm=FALSE'  should/would
mean for the degree 3 interpolation splines provided by
spline() and splinefun().

Also, if you look at the 'rule' argument, it defines how
*extrapolation* should happen, the default rule=1, gives NA
where rule=2 uses "constant extrpolation" even for
method="linear".
Now, I'd argue the following --- assume no NA's in x[] and (x,y)
ordered such that x[] is non-decreasing:
Assume one y missing, say y[k] and we don't want to just drop
the (x[k],y[k]).  This then is in some sense equivalent
to having _two_ separate sets of interpolation points:

 1)  (x[i], y[i]),  i = 1..(k-1)
 2)  (x[i], y[i]),  i = (k+1)..n

and really one could argue that one should use regular
interpolation in both groups, i.e., both x-intervals.
Then, for  x \in [x_{k-1}, x_{k+1}], i.e. between (hence
"outside") both x-intervals, one should use extrapolation, where
then, 'rule' should play a role.  But should we use
extrapolation from the left or from the right interval or from
the mean of the two?
For the default 'rule=1' that would not matter: we'd give NA in
any case, and so my current code (rule=1) would do the right thing.
And rule = c(1,2) or c(2,1)  would also result in NA (if you
take the mean of left and right), but what for rule = 2 ?

After some more experiments (*), I plan to commit my current
version of this to R-devel, so you and others can look at it and
suggest (complete!) patches if desired.
This would only be in *released* R version x.y.0 in ca April 2020..

------------------------------------------------------------------------
*) E.g., what should happen with  na.rm=FALSE and NA's in x[] ?
  Currently (in my version):

     > x2 <- c(1,NA,3:5)
     > approx(x2, x2, na.rm=FALSE) ## -->
     Error in if (!ordered && is.unsorted(x, na.rm = na.rm)) { : 
       missing value where TRUE/FALSE needed
     > approx(x2, x2, method="constant", na.rm=FALSE)
     Error in if (!ordered && is.unsorted(x, na.rm = na.rm)) { : 
       missing value where TRUE/FALSE needed
     > 

   I think an error is "fine" here, but one could also think
   approx()  "should" be more helpful here, and remove (x,y) pairs
   with missing values in x[] in all cases.
------------------------------------------------------------------------

Martin Maechler
ETH Zurich  &  R Core team


    > Example:

    > : R --vanilla

    > R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
    > Copyright (C) 2019 The R Foundation for Statistical Computing
    > Platform: x86_64-apple-darwin15.6.0 (64-bit)
    > ...

    >> t1 <- 1:5
    >> x1 <- c( 1, as.numeric(NA), 3, as.numeric(NA), 5 )
    >> print(data.frame(t1,x1))
    > t1 x1
    > 1  1  1
    > 2  2 NA   <-- we do not know the value between t=2 and t=3
    > 3  3  3
    > 4  4 NA   <-- we do not know the value between t=4 and t=5
    > 5  5  5
    >> X <- approx( t1, x1, (1:4) + 0.5, method='constant',
rule=c(1,2) )
    >> print(data.frame(X))
    > x y
    > 1 1.5 1
    > 2 2.5 1   <---- I believe that these two values should be NA
    > 3 3.5 3
    > 4 4.5 3   <---- I believe that these two values should be NA

    > --
    > Quantitative Brokers         http://www.quantitativebrokers.com




    > -- 

















    > CONFIDENTIALITY NOTICE: This e-mail and any
attachments=...{{dropped:23}}

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.

Robert Almgren

2019-May-08 16:36 UTC

head link

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

On May 8, 2019, at 05:46 EDT, Martin Maechler wrote:> [ __ to R-help __ -- here diverted to R-devel on purpose]
Thank you very much. I posted it to R-help because I was not sure I would be
able to post to R-devel (will try it here)
> What we should *not* do, I think,  is to change the default behavior, even
though 'na.rm=FALSE' *is* the default in many other R functions,
including your example mean().
I completely agree that the default behavior should not change.
> ... it seems your "wishlist item" should be fulfillable without
too much more effort.
Thank you so much for investing that time and effort.
> ... we should eventually also think about what  'na.rm=FALSE' 
should/would mean for the degree 3 interpolation splines provided by spline()
and splinefun().
I think that is somewhat less clear. The advantage of constant and linear
interpolation is that the interpolation is local. So if one value is NA, you can
have the interpolation function be NA only in the small interval that is
affected by that.

For splines (at least if one requires a continuous second derivative), the model
is not local. I think that spline interpolation is a case where it might make
sense to simply delete the points with NA values.
> Assume one y missing, say y[k] and we don't want to just drop the
(x[k],y[k]).  This then is in some sense equivalent to having _two_ separate
sets of interpolation points:
> 1)  (x[i], y[i]),  i = 1..(k-1)
> 2)  (x[i], y[i]),  i = (k+1)..n
I am not sure that interpreting missing values as "internal endpoints"
is necessarily the best way to look at it. For example, one might want to use
different values of the "rule" parameter for internal endpoints than
for the "real" endpoints. And as you point out, a gap in the middle is
off the right end of the left set, and the left end of the right set, so which
value of "rule" should be used?

And spline interpolation often does special things at the endpoints, like
imposing a zero second derivative. That might not necessarily be what you want
at an internal gap.
> *) E.g., what should happen with  na.rm=FALSE and NA's in x[] ?

NA's in x[] is a really tricky question and I don't think there is a
good answer. For an NA in y[] with x[] not NA, the statement is "The y
value at this particular location is missing or undefined, so let's mark as
undefined only the interpolated values that depend explicitly on the missing
value." A NA in x[], especially if the corresponding y[] is not NA, means
"There is an additional data point, but we have no idea where it is. So any
of our interpolated values anywhere might be wrong."

I think the only appropriate behavior is either (a) to remove points for which
x[] is NA (if na.rm=TRUE), or (b) set all interpolated values to NA if any x[]
is NA (if na.rm=FALSE). Or one could throw an error, which is sort of the same
as (b) but can get annoying for large data sets.

Handling missing values in general is a hard problem, with more aspects than can
be built into an interpolation function. With na.rm=TRUE, the mean() function
assumes that missing values have the same mean as the rest of the data set; the
sum() function assumes that missing values are zero (so they do not contribute
to the sum).

The current behavior of the interpolation functions, in the proposed default
case of na.rm=TRUE, in effect assumes that the missing values are whatever is
consistent with the neighboring values. With na.rm=FALSE, they would keep the
values, but set to NA only the local values.

                                    --Robert Almgren
--
Quantitative Brokers         http://www.quantitativebrokers.com

-- 

CONFIDENTIALITY NOTICE: This e-mail and any attachments=...{{dropped:23}}

Martin Maechler

2019-May-10 16:24 UTC

head link

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

I have now committed a version "fulfilling" your wish, partly at
least, to R-devel .

In the new   approx(*, na.rm=FALSE)  cases,
the result of how  NA's are treated  does depend on the
4 different extrapolation rules {1, 2, 1:2, 2:1}

The main reason was that I kept the low level code in C to do
+- what it did before  which automatically was using 'rule' to
determine these results.

The help file contains nice examples.
Here are some of its results  --- comments are very welcome --
> ### Treatment of 'NA's -- are kept if  na.rm=FALSE :
> 
> xn <- 1:4
> yn <- c(1,NA,3:4)
> xout <- (1:9)/2
> ## Default behavior (na.rm = TRUE): NA's omitted; extrapolation gives
NA
> data.frame(approx(xn,yn, xout))    x   y
1 0.5  NA
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
5 2.5 2.5
6 3.0 3.0
7 3.5 3.5
8 4.0 4.0
9 4.5  NA> data.frame(approx(xn,yn, xout, rule = 2))# -> *constant* extrapolation    x   y
1 0.5 1.0
2 1.0 1.0
3 1.5 1.5
4 2.0 2.0
5 2.5 2.5
6 3.0 3.0
7 3.5 3.5
8 4.0 4.0
9 4.5 4.0> ## New (2019-2020)  na.rm = FALSE: NA's are "kept"
> data.frame(approx(xn,yn, xout, na.rm=FALSE, rule = 2))    x   y
1 0.5 1.0
2 1.0 1.0
3 1.5  NA
4 2.0  NA
5 2.5  NA
6 3.0 3.0
7 3.5 3.5
8 4.0 4.0
9 4.5 4.0> data.frame(approx(xn,yn, xout, na.rm=FALSE, rule = 2,
method="constant"))    x  y
1 0.5  1
2 1.0  1
3 1.5  1
4 2.0 NA
5 2.5 NA
6 3.0  3
7 3.5  3
8 4.0  4
9 4.5  4> 
> ## NA's in x[] are not allowed:
> stopifnot(inherits( try( approx(yn,yn, na.rm=FALSE) ),
"try-error"))Error in approx(yn, yn, na.rm = FALSE) : 
  approx(x,y, .., na.rm=FALSE): NA values in x are not
allowed> 
> ## Give a nice overview of all possibilities  rule * method * na.rm :
> ##             -----------------------------  ====   ======   ====> ##
extrapolations "N":= NA;   "C":= Constant :
> rules <- list(N=1, C=2, NC=1:2, CN=2:1)
> methods <- c("constant","linear")
> ry <- sapply(rules, function(R) {+        sapply(methods, function(M)
+         sapply(setNames(,c(TRUE,FALSE)), function(na.)
+                  approx(xn, yn, xout=xout, method=M, rule=R, na.rm=na.)$y),
+         simplify="array")
+    }, simplify="array")> names(dimnames(ry)) <- c("x = ", "na.rm",
"method", "rule")
> dimnames(ry)[[1]] <- dn1 <- format(xout)
> ftable(aperm(ry, 4:1)) # --> (4 * 2 * 2) x length(xout)  =  16 x 9
matrix                    x =  0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
rule method   na.rm                                         
N    constant TRUE        NA 1.0 1.0 1.0 1.0 3.0 3.0 4.0  NA
              FALSE       NA 1.0 1.0  NA  NA 3.0 3.0 4.0  NA
     linear   TRUE        NA 1.0 1.5 2.0 2.5 3.0 3.5 4.0  NA
              FALSE       NA 1.0  NA  NA  NA 3.0 3.5 4.0  NA
C    constant TRUE       1.0 1.0 1.0 1.0 1.0 3.0 3.0 4.0 4.0
              FALSE      1.0 1.0 1.0  NA  NA 3.0 3.0 4.0 4.0
     linear   TRUE       1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.0
              FALSE      1.0 1.0  NA  NA  NA 3.0 3.5 4.0 4.0
NC   constant TRUE        NA 1.0 1.0 1.0 1.0 3.0 3.0 4.0 4.0
              FALSE       NA 1.0 1.0  NA  NA 3.0 3.0 4.0 4.0
     linear   TRUE        NA 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.0
              FALSE       NA 1.0  NA  NA  NA 3.0 3.5 4.0 4.0
CN   constant TRUE       1.0 1.0 1.0 1.0 1.0 3.0 3.0 4.0  NA
              FALSE      1.0 1.0 1.0  NA  NA 3.0 3.0 4.0  NA
     linear   TRUE       1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0  NA
              FALSE      1.0 1.0  NA  NA  NA 3.0 3.5 4.0  NA


>>>>> Martin Maechler 
>>>>>     on Wed, 8 May 2019 11:46:14 +0200 writes:
>>>>> Robert Almgren 
>>>>>     on Fri, 3 May 2019 15:45:44 -0400 writes    > [ __ to R-help __ -- here diverted to R-devel on purpose]

    >> There is something I do not think is right in the approx()
    >> function in base R, with method="constant" and in the
    >> presence of NA values. I have 3.6.0, but the behavior
    >> seems to be the same in earlier versions.

    > (of course; the behavior has been unchanged, and "as
documented" forever)

    >> My suggested fix is to add an "na.rm" argument to
approx(), as in mean(). If this argument is FALSE, then NA values should be
propagated into the output rather than being removed.

    >> Details:

    >> The documentation says 

    >> "f: for method = "constant" a number between 0 and 1
inclusive, indicating a compromise between left- and right-continuous step
functions. If y0 and y1 are the values to the left and right of the point then
the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate
values. In this way the result is right-continuous for f == 0 and
left-continuous for f == 1, even for non-finite y values."

    >> This suggests to me that if the left value y0 is NA, and if f=0
(the default), then the interpolated value should be NA. (Regardless of the
right value y1, see bug 15655 fixed in 2014.)

    >> The documentation further says, below under "Details",
that

    >> "The inputs can contain missing values which are
deleted."

    >> The question is what is the appropriate behavior if one of the
input values y is NA. Currently, approx() seems to interpret NA values as faulty
data points, which should be deleted and the previous values carried forward
(example below).

    > Well, "appropriate behavior" does not depend on just what you
    > want in a given case:  approx() had been designed (for S, ca 40
    > years ago) and well documented to do what it does now, so
    > there's really no bug ...

    > ... but read on

    >> But in many applications, especially with "constant"
interpolation, an NA value is intended to mean that we really do not know the
value in the next interval, or explicitly that there is no value. Therefore the
NA should not be removed, but should be propagated forward into the output
within the corresponding interval.

    >> The situation is similar with functions like mean(). The presence
of an NA value may mean either (a) we want to compute the mean without that
value (na.rm=TRUE), or (b) we really are missing important information, we
cannot determine the mean, and we should return NA (na.rm=FALSE).

    >> Therefore, I propose that approx() also be given an na.rm argument,
indicating whether we wish to delete NA values, or treat them as actual values
on the corresponding interval. That option makes even more sense for approx()
than for mean(), since the NA values apply only on small regions of the data
range.

    >> --Robert Almgren

    > I agree mostly with your thoughts above:  In some cases/applications,
it would be
    > useful and even "appropriate" to be able to use 
'na.rm=FALSE'
    > and have "NA carried forward".

    > What we should *not* do, I think,  is to change the default
    > behavior, even though 'na.rm=FALSE' *is* the default in many
    > other R functions, including your example mean().

    > Usually, we would have asked you here to now  file a *wishlist*
    > report (as opposed to a *bug* report) on R's bugzilla
    > ( https://bugs.r-project.org/ ) ... but I have had some time and
    > interest, so I've now spent a bit of time (> 2 hrs) digging and
trying,
    > notably also in the underlying C code, and it seems your
    > "wishlist item" should be fulfillable without too much more
effort.

    > My change also works correspondingly for the
    > default  method = "linear".

    > Actually, if we think a bit longer about this (and broaden our
horizon),
    > we note that  approx / approxfun really are about degree 0
    > ("constant") and degree 1 interpolation splines, and we
should
    > eventually also think about what  'na.rm=FALSE'  should/would
    > mean for the degree 3 interpolation splines provided by
    > spline() and splinefun().

    > Also, if you look at the 'rule' argument, it defines how
    > *extrapolation* should happen, the default rule=1, gives NA
    > where rule=2 uses "constant extrpolation" even for
method="linear".
    > Now, I'd argue the following --- assume no NA's in x[] and
(x,y)
    > ordered such that x[] is non-decreasing:
    > Assume one y missing, say y[k] and we don't want to just drop
    > the (x[k],y[k]).  This then is in some sense equivalent
    > to having _two_ separate sets of interpolation points:

    > 1)  (x[i], y[i]),  i = 1..(k-1)
    > 2)  (x[i], y[i]),  i = (k+1)..n

    > and really one could argue that one should use regular
    > interpolation in both groups, i.e., both x-intervals.
    > Then, for  x \in [x_{k-1}, x_{k+1}], i.e. between (hence
    > "outside") both x-intervals, one should use extrapolation,
where
    > then, 'rule' should play a role.  But should we use
    > extrapolation from the left or from the right interval or from
    > the mean of the two?
    > For the default 'rule=1' that would not matter: we'd give
NA in
    > any case, and so my current code (rule=1) would do the right thing.
    > And rule = c(1,2) or c(2,1)  would also result in NA (if you
    > take the mean of left and right), but what for rule = 2 ?

    > After some more experiments (*), I plan to commit my current
    > version of this to R-devel, so you and others can look at it and
    > suggest (complete!) patches if desired.
    > This would only be in *released* R version x.y.0 in ca April 2020..

    >
------------------------------------------------------------------------
    > *) E.g., what should happen with  na.rm=FALSE and NA's in x[] ?
    > Currently (in my version):

    >> x2 <- c(1,NA,3:5)
    >> approx(x2, x2, na.rm=FALSE) ## -->
    > Error in if (!ordered && is.unsorted(x, na.rm = na.rm)) { : 
    > missing value where TRUE/FALSE needed
    >> approx(x2, x2, method="constant", na.rm=FALSE)
    > Error in if (!ordered && is.unsorted(x, na.rm = na.rm)) { : 
    > missing value where TRUE/FALSE needed
    >> 

    > I think an error is "fine" here, but one could also think
    > approx()  "should" be more helpful here, and remove (x,y)
pairs
    > with missing values in x[] in all cases.
    >
------------------------------------------------------------------------

    > Martin Maechler
    > ETH Zurich  &  R Core team


    >> Example:

    >> : R --vanilla

    >> R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
    >> Copyright (C) 2019 The R Foundation for Statistical Computing
    >> Platform: x86_64-apple-darwin15.6.0 (64-bit)
    >> ...

    >>> t1 <- 1:5
    >>> x1 <- c( 1, as.numeric(NA), 3, as.numeric(NA), 5 )
    >>> print(data.frame(t1,x1))
    >> t1 x1
    >> 1  1  1
    >> 2  2 NA   <-- we do not know the value between t=2 and t=3
    >> 3  3  3
    >> 4  4 NA   <-- we do not know the value between t=4 and t=5
    >> 5  5  5
    >>> X <- approx( t1, x1, (1:4) + 0.5, method='constant',
rule=c(1,2) )
    >>> print(data.frame(X))
    >> x y
    >> 1 1.5 1
    >> 2 2.5 1   <---- I believe that these two values should be NA
    >> 3 3.5 3
    >> 4 4.5 3   <---- I believe that these two values should be NA

    >> --
    >> Quantitative Brokers         http://www.quantitativebrokers.com




    >> -- 

















    >> CONFIDENTIALITY NOTICE: This e-mail and any
attachments=...{{dropped:23}}

    >> ______________________________________________
    >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
    >> https://stat.ethz.ch/mailman/listinfo/r-help
    >> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    >> and provide commented, minimal, self-contained, reproducible code.

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Robert Almgren

2019-May-10 22:07 UTC

head link

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

On May 10, 2019, at 12:24 EDT, Martin Maechler wrote:> I have now committed a version "fulfilling" your wish, partly at
least, to R-devel .
That is great! Thank you so much. Let me pull it and try it out.

                                       --Robert Almgren
--
Quantitative Brokers         http://www.quantitativebrokers.com




-- 

















CONFIDENTIALITY NOTICE: This e-mail and any attachments=...{{dropped:23}}

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - May 2019 - [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!

Possibly Parallel Threads