Georgi Boshnakov
2023-Jun-01 22:00 UTC
[Rd] bug in na.contiguous? Doesn't give the first tied stretch if it is at the start
Hi. The description of na.contiguous says: "Find the longest consecutive stretch of non-missing values in a time series object. (In the event of a tie, the first such stretch.)" But this seems not to be the case if one of the tied longest stretches is at the start of the sequence/series. In the following example, there are three stretches of length 3, so I expect the result to be [1 2 3]. But:> x <- c(1:3, NA, NA, 6:8, NA, 10:12) > x[1] 1 2 3 NA NA 6 7 8 NA 10 11 12> na.contiguous(x)[1] 6 7 8 ## expected: [1] 1 2 3 (I have stripped attributes from the output for clarity.) Below is the beginning of stats:::na.contiguous.default. The source of the issue appears to be the line containing the assignment to 'seg' (marked with exclamation marks). The calculation leading to it does cumsum(!good) where !good in this case is [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE And its cumsum is: [1] 0 0 0 1 2 2 2 2 3 3 3 3 Then the assignment to 'seg' below picks the first longest stretch and subtracts 1, since the cumsum at indices corresponding to FALSE stays constant but the length of the constant stretch is one more then the number of FALSEs, ... except for the stretch at the start of the series which is not preceded by TRUE! So it is missed. One way to patch this could be by the two commented assignments added by me to the code below to prepend a 0 to tt and then drop the first element of 'keep' to allow correct indexing later. Georgi Boshnakov> stats:::na.contiguous.defaultfunction (object, ...) { tm <- time(object) xfreq <- frequency(object) if (is.matrix(object)) good <- apply(!is.na(object), 1L, all) else good <- !is.na(object) if (!sum(good)) stop("all times contain an NA") tt <- cumsum(!good) ## tt <- c(0, tt) ln <- sapply(0:max(tt), function(i) sum(tt == i)) seg <- (seq_along(ln)[ln == max(ln)])[1L] - 1 ## !!! keep <- (tt == seg) ## keep <- keep[-1] st <- min(which(keep)) [[alternative HTML version deleted]]
Martin Maechler
2023-Jun-02 08:38 UTC
[Rd] bug in na.contiguous? Doesn't give the first tied stretch if it is at the start
>>>>> Georgi Boshnakov >>>>> on Thu, 1 Jun 2023 22:00:39 +0000 writes:> Hi. > The description of na.contiguous says: > "Find the longest consecutive stretch of non-missing values in a > time series object. (In the event of a tie, the first such > stretch.)" > But this seems not to be the case if one of the tied longest stretches is at the start of the sequence/series. In the following example, there are three stretches of length 3, so I expect the result to be [1 2 3]. But: >> x <- c(1:3, NA, NA, 6:8, NA, 10:12) >> x > [1] 1 2 3 NA NA 6 7 8 NA 10 11 12 >> na.contiguous(x) > [1] 6 7 8 > ## expected: [1] 1 2 3 > (I have stripped attributes from the output for clarity.) > Below is the beginning of stats:::na.contiguous.default. > The source of the issue appears to be the line containing the assignment to 'seg' (marked with exclamation marks). > The calculation leading to it does > cumsum(!good) > where !good in this case is > [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE > And its cumsum is: > [1] 0 0 0 1 2 2 2 2 3 3 3 3 > Then the assignment to 'seg' below picks the first longest stretch and subtracts 1, since the cumsum at indices corresponding to FALSE stays constant but the length of the constant stretch is one more then the number of FALSEs, ... except for the stretch at the start of the series which is not preceded by TRUE! So it is missed. > One way to patch this could be by the two commented assignments added by me to the code below to prepend a 0 to tt and then drop the first element of 'keep' to allow correct indexing later. > Georgi Boshnakov Thanks a lot, Georgi, for raising this. I think you are right : 1) this is a bug {in the R code base since the beginning (na.contiguous added to R in 1999)} 2) your proposition is a good solution I've started to prepare a commit to fix it. (but will not haste to do that.. so more comments are welcome!) Martin >> stats:::na.contiguous.default > function (object, ...) > { > tm <- time(object) > xfreq <- frequency(object) > if (is.matrix(object)) > good <- apply(!is.na(object), 1L, all) > else good <- !is.na(object) > if (!sum(good)) > stop("all times contain an NA") > tt <- cumsum(!good) > ## tt <- c(0, tt) > ln <- sapply(0:max(tt), function(i) sum(tt == i)) > seg <- (seq_along(ln)[ln == max(ln)])[1L] - 1 ## !!! > keep <- (tt == seg) > ## keep <- keep[-1] > st <- min(which(keep)) > [[alternative HTML version deleted]] > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel