peter dalgaard
2017-May-29 08:02 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). Notice, incidentally, that> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))Call: line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) Coefficients: [1] -0.9407 1.1948 I.e., it is not likely an issue with exact integers or perfect fit. -pd> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: > >> Tukey divides the points into three groups, not the x and y values > separately. > >> I'll try to get hold of the book for a direct quote, might take a couple > of days. > > Ah well, I can't get it for a week. But the fact that it's often called > Tukey's three group line (try a search on *tukey three group line* and > you'll get plenty of hits) is pretty much a giveaway. > > > On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: > >> Tukey divides the points into three groups, not the x and y values >> separately. >> >> I'll try to get hold of the book for a direct quote, might take a couple >> of days. >> >> >> >> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >> wrote: >> >>> On 27/05/2017 9:28 PM, GlenB wrote: >>> >>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>> or >>>> 3 >>>> >>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>> intercept -1 and slope 1.2 >>>> >>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>> of >>>> length 6, with four of every six correct. >>>> >>>> Bug has been present across many versions. >>>> >>>> The machine I just tried it on just now has R3.2.3: >>>> >>> >>> If you look at the source (in src/library/stats/src/line.c), the >>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>> to a particular definition of quantile), and the y value is chosen as the >>> median of the y values where x is less than or equal to the 1/3 quantile. >>> Those are different definitions (though I think they would be >>> asymptotically equivalent under pretty weak assumptions), so it's not >>> surprising the x value doesn't correspond perfectly to the y value, and the >>> line ends up "wrong". >>> >>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>> a copy of his book handy so I can't really say. Maybe the R function is >>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>> is wrong. >>> >>> Duncan Murdoch >>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Serguei Sokol
2017-May-29 10:21 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
The problem or actual R implementation relies on an assumption that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) which reveals not to be true despite very trustful appearance. If we continue with the example of x=y=1:9 then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) while median(y[i] | x[i] <= quantile(x, 1/3))=2 On the other sample's side we've got 7.5 and 8 for x and y respectively. Hence the slope (8-2)/(7.5-2.5)=1.2 To get a correct version of this, one should calculate x robust points in the same way as the y's, i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) Best, Serguei. Le 29/05/2017 ? 10:02, peter dalgaard a ?crit :> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: > > https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line > > This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). > > Notice, incidentally, that > >> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) > Call: > line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) > > Coefficients: > [1] -0.9407 1.1948 > > I.e., it is not likely an issue with exact integers or perfect fit. > > -pd > > > >> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: >> >>> Tukey divides the points into three groups, not the x and y values >> separately. >> >>> I'll try to get hold of the book for a direct quote, might take a couple >> of days. >> >> Ah well, I can't get it for a week. But the fact that it's often called >> Tukey's three group line (try a search on *tukey three group line* and >> you'll get plenty of hits) is pretty much a giveaway. >> >> >> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: >> >>> Tukey divides the points into three groups, not the x and y values >>> separately. >>> >>> I'll try to get hold of the book for a direct quote, might take a couple >>> of days. >>> >>> >>> >>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >>> wrote: >>> >>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>> >>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>> or >>>>> 3 >>>>> >>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>> intercept -1 and slope 1.2 >>>>> >>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>> of >>>>> length 6, with four of every six correct. >>>>> >>>>> Bug has been present across many versions. >>>>> >>>>> The machine I just tried it on just now has R3.2.3: >>>>> >>>> If you look at the source (in src/library/stats/src/line.c), the >>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>> to a particular definition of quantile), and the y value is chosen as the >>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>> Those are different definitions (though I think they would be >>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>> line ends up "wrong". >>>> >>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>> a copy of his book handy so I can't really say. Maybe the R function is >>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>> is wrong. >>>> >>>> Duncan Murdoch >>>> >>>> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel-- Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: sokol at insa-toulouse.fr http://metasys.insa-toulouse.fr http://www.lisbp.fr
Serguei Sokol
2017-May-29 13:13 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Here is an attached patch. Best, Serguei. Le 29/05/2017 ? 12:21, Serguei Sokol a ?crit :> The problem or actual R implementation relies on an assumption > that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) > which reveals not to be true despite very trustful appearance. > > If we continue with the example of x=y=1:9 > then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one) > while median(y[i] | x[i] <= quantile(x, 1/3))=2 > On the other sample's side we've got 7.5 and 8 for x and y respectively. > Hence the slope (8-2)/(7.5-2.5)=1.2 > > To get a correct version of this, one should calculate x robust points in the same way as the y's, > i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= quantile(x, 2/3)) > > Best, > Serguei. > > Le 29/05/2017 ? 10:02, peter dalgaard a ?crit : >> A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: >> >> https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line >> >> This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). >> >> Notice, incidentally, that >> >>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >> Call: >> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >> >> Coefficients: >> [1] -0.9407 1.1948 >> >> I.e., it is not likely an issue with exact integers or perfect fit. >> >> -pd >> >> >> >>> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: >>> >>>> Tukey divides the points into three groups, not the x and y values >>> separately. >>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>> of days. >>> >>> Ah well, I can't get it for a week. But the fact that it's often called >>> Tukey's three group line (try a search on *tukey three group line* and >>> you'll get plenty of hits) is pretty much a giveaway. >>> >>> >>> On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: >>> >>>> Tukey divides the points into three groups, not the x and y values >>>> separately. >>>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>>> of days. >>>> >>>> >>>> >>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >>>> wrote: >>>> >>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>> >>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>>>> or >>>>>> 3 >>>>>> >>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>>>> intercept -1 and slope 1.2 >>>>>> >>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>>>> of >>>>>> length 6, with four of every six correct. >>>>>> >>>>>> Bug has been present across many versions. >>>>>> >>>>>> The machine I just tried it on just now has R3.2.3: >>>>>> >>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>>>> to a particular definition of quantile), and the y value is chosen as the >>>>> median of the y values where x is less than or equal to the 1/3 quantile. >>>>> Those are different definitions (though I think they would be >>>>> asymptotically equivalent under pretty weak assumptions), so it's not >>>>> surprising the x value doesn't correspond perfectly to the y value, and the >>>>> line ends up "wrong". >>>>> >>>>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>>>> a copy of his book handy so I can't really say. Maybe the R function is >>>>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>>>> is wrong. >>>>> >>>>> Duncan Murdoch >>>>> >>>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Serguei Sokol Ingenieur de recherche INRA Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys) LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504 135 Avenue de Rangueil 31077 Toulouse Cedex 04 tel: +33 5 6155 9276 fax: +33 5 6704 8825 email: sokol at insa-toulouse.fr http://metasys.insa-toulouse.fr http://www.lisbp.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: line.c.patch Type: text/x-patch Size: 2651 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170529/1f1f2e6c/attachment.bin>
Possibly Parallel Threads
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- [New Patch] Fix disk corruption when writing
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3