Serguei Sokol
2017-May-29 13:13 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Here is an attached patch. Best, Serguei. Le 29/05/2017 ? 12:21, Serguei Sokol a ?crit :> The problem or actual R implementation relies on an assumption > that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) > which reveals not to be true despite very trustful appearance. > > If we continue with the example of x=y=1:9 > then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) > while median(y[i] | x[i] <= quantile(x, 1/3))=2 > On the other sample's side we've got 7.5 and 8 for x and y respectively. > Hence the slope (8-2)/(7.5-2.5)=1.2 > > To get a correct version of this, one should calculate x robust points in the same way as the y's, > i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) > > Best, > Serguei. > > Le 29/05/2017 ? 10:02, peter dalgaard a ?crit : >> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: >> >> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line >> >> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). >> >> Notice, incidentally, that >> >>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >> Call: >> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >> >> Coefficients: >> [1] -0.9407 1.1948 >> >> I.e., it is not likely an issue with exact integers or perfect fit. >> >> -pd >> >> >> >>> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: >>> >>>> Tukey divides the points into three groups, not the x and y values >>> separately. >>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>> of days. >>> >>> Ah well, I can't get it for a week. But the fact that it's often called >>> Tukey's three group line (try a search on *tukey three group line* and >>> you'll get plenty of hits) is pretty much a giveaway. >>> >>> >>> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: >>> >>>> Tukey divides the points into three groups, not the x and y values >>>> separately. >>>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>>> of days. >>>> >>>> >>>> >>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >>>> wrote: >>>> >>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>> >>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>>> or >>>>>> 3 >>>>>> >>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>>> intercept -1 and slope 1.2 >>>>>> >>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>>> of >>>>>> length 6, with four of every six correct. >>>>>> >>>>>> Bug has been present across many versions. >>>>>> >>>>>> The machine I just tried it on just now has R3.2.3: >>>>>> >>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>>> to a particular definition of quantile), and the y value is chosen as the >>>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>>> Those are different definitions (though I think they would be >>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>>> line ends up "wrong". >>>>> >>>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>>> a copy of his book handy so I can't really say. Maybe the R function is >>>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>>> is wrong. >>>>> >>>>> Duncan Murdoch >>>>> >>>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: sokol at insa-toulouse.fr http://metasys.insa-toulouse.fr http://www.lisbp.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: line.c.patch Type: text/x-patch Size: 2651 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170529/1f1f2e6c/attachment.bin>
GlenB
2017-May-29 13:27 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Incidentally (though I don't expect anyone will want to pursue it) Johnstone & Velleman give standard errors for the estimates in Johnstone, Iain M., and Paul F. Velleman. ?The Resistant Line and Related Regression Methods.? Journal of the American Statistical Association, vol. 80, no. 392, 1985, pp. 1041?1054. On Mon, May 29, 2017 at 11:13 PM, Serguei Sokol <sokol at insa-toulouse.fr> wrote:> Here is an attached patch. > > Best, > Serguei. > > > Le 29/05/2017 ? 12:21, Serguei Sokol a ?crit : > >> The problem or actual R implementation relies on an assumption >> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) >> which reveals not to be true despite very trustful appearance. >> >> If we continue with the example of x=y=1:9 >> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not >> R's one) >> while median(y[i] | x[i] <= quantile(x, 1/3))=2 >> On the other sample's side we've got 7.5 and 8 for x and y respectively. >> Hence the slope (8-2)/(7.5-2.5)=1.2 >> >> To get a correct version of this, one should calculate x robust points in >> the same way as the y's, >> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >> >= quantile(x, 2/3)) >> >> Best, >> Serguei. >> >> Le 29/05/2017 ? 10:02, peter dalgaard a ?crit : >> >>> A usually trustworthy R correspondent posted a pure R implementation on >>> SO at some point in his lost youth: >>> >>> https://stackoverflow.com/questions/3224731/john-tukey-media >>> n-median-or-resistant-line-statistical-test-for-r-and-line >>> >>> This one does indeed generate the line of identity for the (1:9, 1:9) >>> case, so I do suspect that we have a genuine scr*wup in line(). >>> >>> Notice, incidentally, that >>> >>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >>>> >>> Call: >>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >>> >>> Coefficients: >>> [1] -0.9407 1.1948 >>> >>> I.e., it is not likely an issue with exact integers or perfect fit. >>> >>> -pd >>> >>> >>> >>> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: >>>> >>>> Tukey divides the points into three groups, not the x and y values >>>>> >>>> separately. >>>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>>>> >>>> of days. >>>> >>>> Ah well, I can't get it for a week. But the fact that it's often called >>>> Tukey's three group line (try a search on *tukey three group line* and >>>> you'll get plenty of hits) is pretty much a giveaway. >>>> >>>> >>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: >>>> >>>> Tukey divides the points into three groups, not the x and y values >>>>> separately. >>>>> >>>>> I'll try to get hold of the book for a direct quote, might take a >>>>> couple >>>>> of days. >>>>> >>>>> >>>>> >>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch < >>>>> murdoch.duncan at gmail.com> >>>>> wrote: >>>>> >>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>>> >>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 >>>>>>> is 2 >>>>>>> or >>>>>>> 3 >>>>>>> >>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it >>>>>>> gives >>>>>>> intercept -1 and slope 1.2 >>>>>>> >>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a >>>>>>> cycle >>>>>>> of >>>>>>> length 6, with four of every six correct. >>>>>>> >>>>>>> Bug has been present across many versions. >>>>>>> >>>>>>> The machine I just tried it on just now has R3.2.3: >>>>>>> >>>>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>>> explanation is clear: the x value is chosen as the 1/6 quantile >>>>>> (according >>>>>> to a particular definition of quantile), and the y value is chosen as >>>>>> the >>>>>> median of the y values where x is less than or equal to the 1/3 >>>>>> quantile. >>>>>> Those are different definitions (though I think they would be >>>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>>> surprising the x value doesn't correspond perfectly to the y value, >>>>>> and the >>>>>> line ends up "wrong". >>>>>> >>>>>> So is it a bug? Well, that depends on Tukey's definition. I don't >>>>>> have >>>>>> a copy of his book handy so I can't really say. Maybe the R function >>>>>> is >>>>>> doing exactly what Tukey said it should, and that's not a bug. Or >>>>>> maybe R >>>>>> is wrong. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-devel at r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >> >> > -- > Serguei Sokol > Ingenieur de recherche INRA > Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) > > LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 > 135 Avenue de Rangueil > 31077 Toulouse Cedex 04 > > tel: +33 5 6155 9276 > fax: +33 5 6704 8825 > email: sokol at insa-toulouse.fr > http://metasys.insa-toulouse.fr > http://www.lisbp.fr > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Serguei Sokol
2017-May-29 13:28 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Sorry, I have seen it too late that we had different tab width in the original file and my editor. Here is the patch with all white spaces instead of mixing tabs and white spaces. Serguei. Le 29/05/2017 ? 15:13, Serguei Sokol a ?crit :> Here is an attached patch. > > Best, > Serguei. > > Le 29/05/2017 ? 12:21, Serguei Sokol a ?crit : >> The problem or actual R implementation relies on an assumption >> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) >> which reveals not to be true despite very trustful appearance. >> >> If we continue with the example of x=y=1:9 >> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) >> while median(y[i] | x[i] <= quantile(x, 1/3))=2 >> On the other sample's side we've got 7.5 and 8 for x and y respectively. >> Hence the slope (8-2)/(7.5-2.5)=1.2 >> >> To get a correct version of this, one should calculate x robust points in the same way as the y's, >> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) >> >> Best, >> Serguei. >> >> Le 29/05/2017 ? 10:02, peter dalgaard a ?crit : >>> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: >>> >>> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line >>> >>> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). >>> >>> Notice, incidentally, that >>> >>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >>> Call: >>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >>> >>> Coefficients: >>> [1] -0.9407 1.1948 >>> >>> I.e., it is not likely an issue with exact integers or perfect fit. >>> >>> -pd >>> >>> >>> >>>> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: >>>> >>>>> Tukey divides the points into three groups, not the x and y values >>>> separately. >>>> >>>>> I'll try to get hold of the book for a direct quote, might take a couple >>>> of days. >>>> >>>> Ah well, I can't get it for a week. But the fact that it's often called >>>> Tukey's three group line (try a search on *tukey three group line* and >>>> you'll get plenty of hits) is pretty much a giveaway. >>>> >>>> >>>> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: >>>> >>>>> Tukey divides the points into three groups, not the x and y values >>>>> separately. >>>>> >>>>> I'll try to get hold of the book for a direct quote, might take a couple >>>>> of days. >>>>> >>>>> >>>>> >>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >>>>> wrote: >>>>> >>>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>>> >>>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>>>> or >>>>>>> 3 >>>>>>> >>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>>>> intercept -1 and slope 1.2 >>>>>>> >>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>>>> of >>>>>>> length 6, with four of every six correct. >>>>>>> >>>>>>> Bug has been present across many versions. >>>>>>> >>>>>>> The machine I just tried it on just now has R3.2.3: >>>>>>> >>>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>>>> to a particular definition of quantile), and the y value is chosen as the >>>>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>>>> Those are different definitions (though I think they would be >>>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>>>> line ends up "wrong". >>>>>> >>>>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>>>> a copy of his book handy so I can't really say. Maybe the R function is >>>>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>>>> is wrong. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-devel at r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: sokol at insa-toulouse.fr http://metasys.insa-toulouse.fr http://www.lisbp.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: line.c.patch Type: text/x-patch Size: 3680 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170529/5c29b34e/attachment.bin>
Martin Maechler
2017-May-30 07:33 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>> on Mon, 29 May 2017 15:28:12 +0200 writes:> Sorry, I have seen it too late that we had different tab > width in the original file and my editor. Here is the > patch with all white spaces instead of mixing tabs and > white spaces. thank you - it still gives quite a few unnecessary (i.e. "white space") diffs with the source code, but that's not the problem : The patch leads to correct results for the simple (1:k, 1:k) data sets (for all k). However, even after the patch, The example from the SO post differs from the result of Richie Cotton's function...(even though that function had a silly bug in step 1, the bug has not been "kicking" for the example): Here's a fixed-up version of the pure R function and the example and some comments : ## From Stackoverflow ## http://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line ## median_median_line by Richie Cotton (July 12 2010, last edited at 13:49) ## ## Shorter variable names, fixed bug in step 1, added 'plot.' option: Martin Maechler, May 2017 MMline <- function(x, y, data, plot. = FALSE) { if(!missing(data)) { x <- eval(substitute(x), data) y <- eval(substitute(y), data) } stopifnot((n <- length(x)) == length(y), n >= 2) ## Step 1 n.3 <- n %/% 3L groups <- rep(1:3, times = switch((n %% 3) + 1L, n.3, c(n.3, n.3 + 1L, n.3), c(n.3 + 1L, n.3, n.3 + 1L) )) groups <- lapply(list(groups), as.factor) # (gain a bit in tapply()) ## Step 2 ## sort *both* x & y "along x": x <- sort.int(x, index.return = TRUE) y <- y[x$ix] x <- x$x if(plot.) plot(x,y) ## Step 3 m_x <- tapply(x, groups, median) m_y <- tapply(y, groups, median) if(plot.) { points(m_x, m_y, cex=2, pch=3, col="red") segments(m_x[1],m_y[1], m_x[3],m_y[3], col="red") } ## Step 4 R <- if(n == 2) 2L else 3L slope <- (m_y[[R]] - m_y[[1]]) / (m_x[[R]] - m_x[[1]]) intercept <- m_y[[1]] - slope * m_x[[1]] ## Step 5 mid_y <- intercept + slope * m_x[[2]] intercept <- intercept + (m_y[[2]] - mid_y) / 3 if(plot.) abline(a = intercept, b = slope, col=adjustcolor("midnight blue", .5), lwd=2) c(intercept = intercept, slope = slope) } ## To test it, here's the second example from that page: dfr <- data.frame( time = c( .16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89), distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1,124.6,129.7, 150.2,182.2,189.4,220.4,250.4, 261.0,334.5,375.5,399.1)) MMline(time, distance, dfr, plot.=TRUE) ## intercept slope ## -113.6 520.0 par(new=TRUE)# should plot identically! MMline(time, distance, dfr[sample.int(nrow(dfr)), ], plot.=TRUE) ## Note the slightly odd way of specifying the groups. The instructions are ## quite picky about how you define group sizes, so the more obvious method ## of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work. ## edited Jul 12 '10 at 13:49 / answered Jul 12 '10 at 13:36 ## Richie Cotton ## And someone remarked that R's line() did not give the same: with(dfr, line(time, distance)) ## ... ## Coefficients: ## [1] -108.8 505.2 abline(-108.8, 505.2, col="blue") ##=> this one is wrong ## MM: (cfs <- t(sapply(setNames(,2:50), function(k) {x <- 1:k; MMline(x,x)}))) ## intercept slope ## 2 0 1 # (special case fixed) ## 3 0 1 ## 4 0 1 ## 5 0 1 ## ............ ## 49 0 1 ## 50 0 1 ## "Harder": (sort ing!) cf2 <- t(sapply(setNames(,2:50), function(k) {x <- sample.int(k); MMline(x, -10*x)})) stopifnot(abs(cf2[,"intercept"] - 0) < 1e-12, abs(cf2[, "slope"] - -10) < 1e-12)
Apparently Analagous Threads
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3