There is something I do not think is right in the approx() function in base R, with method="constant" and in the presence of NA values. I have 3.6.0, but the behavior seems to be the same in earlier versions. My suggested fix is to add an "na.rm" argument to approx(), as in mean(). If this argument is FALSE, then NA values should be propagated into the output rather than being removed. Details: The documentation says "f: for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values." This suggests to me that if the left value y0 is NA, and if f=0 (the default), then the interpolated value should be NA. (Regardless of the right value y1, see bug 15655 fixed in 2014.) The documentation further says, below under "Details", that "The inputs can contain missing values which are deleted." The question is what is the appropriate behavior if one of the input values y is NA. Currently, approx() seems to interpret NA values as faulty data points, which should be deleted and the previous values carried forward (example below). But in many applications, especially with "constant" interpolation, an NA value is intended to mean that we really do not know the value in the next interval, or explicitly that there is no value. Therefore the NA should not be removed, but should be propagated forward into the output within the corresponding interval. The situation is similar with functions like mean(). The presence of an NA value may mean either (a) we want to compute the mean without that value (na.rm=TRUE), or (b) we really are missing important information, we cannot determine the mean, and we should return NA (na.rm=FALSE). Therefore, I propose that approx() also be given an na.rm argument, indicating whether we wish to delete NA values, or treat them as actual values on the corresponding interval. That option makes even more sense for approx() than for mean(), since the NA values apply only on small regions of the data range. --Robert Almgren Example: : R --vanilla R version 3.6.0 (2019-04-26) -- "Planting of a Tree" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit) ...> t1 <- 1:5 > x1 <- c( 1, as.numeric(NA), 3, as.numeric(NA), 5 ) > print(data.frame(t1,x1))t1 x1 1 1 1 2 2 NA <-- we do not know the value between t=2 and t=3 3 3 3 4 4 NA <-- we do not know the value between t=4 and t=5 5 5 5> X <- approx( t1, x1, (1:4) + 0.5, method='constant', rule=c(1,2) ) > print(data.frame(X))x y 1 1.5 1 2 2.5 1 <---- I believe that these two values should be NA 3 3.5 3 4 4.5 3 <---- I believe that these two values should be NA -- Quantitative Brokers http://www.quantitativebrokers.com -- CONFIDENTIALITY NOTICE: This e-mail and any attachments=...{{dropped:23}}
Dear Robert, this is really not asking for help about R but rather wishing for new features of a (very long) existing R function. Hence this is a topic for the 'R-devel' mailing list (https://stat.ethz.ch/mailman/listinfo/R-devel ) rather than 'R-help'; see also https://www.r-project.org/mail.html on what the different lists are aimed at. ==> I will do a long reply to this post but divert it to R-devel (and will CC you at least in the first reply). --> Further follow up to this: Please on 'R-devel'>>>>> Robert Almgren >>>>> on Fri, 3 May 2019 15:45:44 -0400 writes:> There is something I do not think is right in the approx() function in base R, with method="constant" and in the presence of NA values. I have 3.6.0, but the behavior seems to be the same in earlier versions. > My suggested fix is to add an "na.rm" argument to approx(), as in mean(). If this argument is FALSE, then NA values should be propagated into the output rather than being removed. > Details: > The documentation says > "f: for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values." > This suggests to me that if the left value y0 is NA, and if f=0 (the default), then the interpolated value should be NA. (Regardless of the right value y1, see bug 15655 fixed in 2014.) > The documentation further says, below under "Details", that > "The inputs can contain missing values which are deleted." > The question is what is the appropriate behavior if one of the input values y is NA. Currently, approx() seems to interpret NA values as faulty data points, which should be deleted and the previous values carried forward (example below). > But in many applications, especially with "constant" interpolation, an NA value is intended to mean that we really do not know the value in the next interval, or explicitly that there is no value. Therefore the NA should not be removed, but should be propagated forward into the output within the corresponding interval. > The situation is similar with functions like mean(). The presence of an NA value may mean either (a) we want to compute the mean without that value (na.rm=TRUE), or (b) we really are missing important information, we cannot determine the mean, and we should return NA (na.rm=FALSE). > Therefore, I propose that approx() also be given an na.rm argument, indicating whether we wish to delete NA values, or treat them as actual values on the corresponding interval. That option makes even more sense for approx() than for mean(), since the NA values apply only on small regions of the data range. > --Robert Almgren > Example: > : R --vanilla > R version 3.6.0 (2019-04-26) -- "Planting of a Tree" > Copyright (C) 2019 The R Foundation for Statistical Computing > Platform: x86_64-apple-darwin15.6.0 (64-bit) > ... >> t1 <- 1:5 >> x1 <- c( 1, as.numeric(NA), 3, as.numeric(NA), 5 ) >> print(data.frame(t1,x1)) > t1 x1 > 1 1 1 > 2 2 NA <-- we do not know the value between t=2 and t=3 > 3 3 3 > 4 4 NA <-- we do not know the value between t=4 and t=5 > 5 5 5 >> X <- approx( t1, x1, (1:4) + 0.5, method='constant', rule=c(1,2) ) >> print(data.frame(X)) > x y > 1 1.5 1 > 2 2.5 1 <---- I believe that these two values should be NA > 3 3.5 3 > 4 4.5 3 <---- I believe that these two values should be NA > --
Martin Maechler
2019-May-08 09:46 UTC
[Rd] [R] approx with NAs --> new argument 'na.rm=TRUE' ?!
>>>>> Robert Almgren >>>>> on Fri, 3 May 2019 15:45:44 -0400 writes[ __ to R-help __ -- here diverted to R-devel on purpose] > There is something I do not think is right in the approx() > function in base R, with method="constant" and in the > presence of NA values. I have 3.6.0, but the behavior > seems to be the same in earlier versions. (of course; the behavior has been unchanged, and "as documented" forever) > My suggested fix is to add an "na.rm" argument to approx(), as in mean(). If this argument is FALSE, then NA values should be propagated into the output rather than being removed. > Details: > The documentation says > "f: for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values." > This suggests to me that if the left value y0 is NA, and if f=0 (the default), then the interpolated value should be NA. (Regardless of the right value y1, see bug 15655 fixed in 2014.) > The documentation further says, below under "Details", that > "The inputs can contain missing values which are deleted." > The question is what is the appropriate behavior if one of the input values y is NA. Currently, approx() seems to interpret NA values as faulty data points, which should be deleted and the previous values carried forward (example below). Well, "appropriate behavior" does not depend on just what you want in a given case: approx() had been designed (for S, ca 40 years ago) and well documented to do what it does now, so there's really no bug ... ... but read on > But in many applications, especially with "constant" interpolation, an NA value is intended to mean that we really do not know the value in the next interval, or explicitly that there is no value. Therefore the NA should not be removed, but should be propagated forward into the output within the corresponding interval. > The situation is similar with functions like mean(). The presence of an NA value may mean either (a) we want to compute the mean without that value (na.rm=TRUE), or (b) we really are missing important information, we cannot determine the mean, and we should return NA (na.rm=FALSE). > Therefore, I propose that approx() also be given an na.rm argument, indicating whether we wish to delete NA values, or treat them as actual values on the corresponding interval. That option makes even more sense for approx() than for mean(), since the NA values apply only on small regions of the data range. > --Robert Almgren I agree mostly with your thoughts above: In some cases/applications, it would be useful and even "appropriate" to be able to use 'na.rm=FALSE' and have "NA carried forward". What we should *not* do, I think, is to change the default behavior, even though 'na.rm=FALSE' *is* the default in many other R functions, including your example mean(). Usually, we would have asked you here to now file a *wishlist* report (as opposed to a *bug* report) on R's bugzilla ( https://bugs.r-project.org/ ) ... but I have had some time and interest, so I've now spent a bit of time (> 2 hrs) digging and trying, notably also in the underlying C code, and it seems your "wishlist item" should be fulfillable without too much more effort. My change also works correspondingly for the default method = "linear". Actually, if we think a bit longer about this (and broaden our horizon), we note that approx / approxfun really are about degree 0 ("constant") and degree 1 interpolation splines, and we should eventually also think about what 'na.rm=FALSE' should/would mean for the degree 3 interpolation splines provided by spline() and splinefun(). Also, if you look at the 'rule' argument, it defines how *extrapolation* should happen, the default rule=1, gives NA where rule=2 uses "constant extrpolation" even for method="linear". Now, I'd argue the following --- assume no NA's in x[] and (x,y) ordered such that x[] is non-decreasing: Assume one y missing, say y[k] and we don't want to just drop the (x[k],y[k]). This then is in some sense equivalent to having _two_ separate sets of interpolation points: 1) (x[i], y[i]), i = 1..(k-1) 2) (x[i], y[i]), i = (k+1)..n and really one could argue that one should use regular interpolation in both groups, i.e., both x-intervals. Then, for x \in [x_{k-1}, x_{k+1}], i.e. between (hence "outside") both x-intervals, one should use extrapolation, where then, 'rule' should play a role. But should we use extrapolation from the left or from the right interval or from the mean of the two? For the default 'rule=1' that would not matter: we'd give NA in any case, and so my current code (rule=1) would do the right thing. And rule = c(1,2) or c(2,1) would also result in NA (if you take the mean of left and right), but what for rule = 2 ? After some more experiments (*), I plan to commit my current version of this to R-devel, so you and others can look at it and suggest (complete!) patches if desired. This would only be in *released* R version x.y.0 in ca April 2020.. ------------------------------------------------------------------------ *) E.g., what should happen with na.rm=FALSE and NA's in x[] ? Currently (in my version): > x2 <- c(1,NA,3:5) > approx(x2, x2, na.rm=FALSE) ## --> Error in if (!ordered && is.unsorted(x, na.rm = na.rm)) { : missing value where TRUE/FALSE needed > approx(x2, x2, method="constant", na.rm=FALSE) Error in if (!ordered && is.unsorted(x, na.rm = na.rm)) { : missing value where TRUE/FALSE needed > I think an error is "fine" here, but one could also think approx() "should" be more helpful here, and remove (x,y) pairs with missing values in x[] in all cases. ------------------------------------------------------------------------ Martin Maechler ETH Zurich & R Core team > Example: > : R --vanilla > R version 3.6.0 (2019-04-26) -- "Planting of a Tree" > Copyright (C) 2019 The R Foundation for Statistical Computing > Platform: x86_64-apple-darwin15.6.0 (64-bit) > ... >> t1 <- 1:5 >> x1 <- c( 1, as.numeric(NA), 3, as.numeric(NA), 5 ) >> print(data.frame(t1,x1)) > t1 x1 > 1 1 1 > 2 2 NA <-- we do not know the value between t=2 and t=3 > 3 3 3 > 4 4 NA <-- we do not know the value between t=4 and t=5 > 5 5 5 >> X <- approx( t1, x1, (1:4) + 0.5, method='constant', rule=c(1,2) ) >> print(data.frame(X)) > x y > 1 1.5 1 > 2 2.5 1 <---- I believe that these two values should be NA > 3 3.5 3 > 4 4.5 3 <---- I believe that these two values should be NA > -- > Quantitative Brokers http://www.quantitativebrokers.com > -- > CONFIDENTIALITY NOTICE: This e-mail and any attachments=...{{dropped:23}} > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.