GlenB
2017-May-29 04:19 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Tukey divides the points into three groups, not the x and y values separately. I'll try to get hold of the book for a direct quote, might take a couple of days. On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 27/05/2017 9:28 PM, GlenB wrote: > >> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >> or >> 3 >> >> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >> intercept -1 and slope 1.2 >> >> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of >> length 6, with four of every six correct. >> >> Bug has been present across many versions. >> >> The machine I just tried it on just now has R3.2.3: >> > > If you look at the source (in src/library/stats/src/line.c), the > explanation is clear: the x value is chosen as the 1/6 quantile (according > to a particular definition of quantile), and the y value is chosen as the > median of the y values where x is less than or equal to the 1/3 quantile. > Those are different definitions (though I think they would be > asymptotically equivalent under pretty weak assumptions), so it's not > surprising the x value doesn't correspond perfectly to the y value, and the > line ends up "wrong". > > So is it a bug? Well, that depends on Tukey's definition. I don't have a > copy of his book handy so I can't really say. Maybe the R function is > doing exactly what Tukey said it should, and that's not a bug. Or maybe R > is wrong. > > Duncan Murdoch > >[[alternative HTML version deleted]]
GlenB
2017-May-29 05:21 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
> Tukey divides the points into three groups, not the x and y valuesseparately.> I'll try to get hold of the book for a direct quote, might take a coupleof days. Ah well, I can't get it for a week. But the fact that it's often called Tukey's three group line (try a search on *tukey three group line* and you'll get plenty of hits) is pretty much a giveaway. On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote:> Tukey divides the points into three groups, not the x and y values > separately. > > I'll try to get hold of the book for a direct quote, might take a couple > of days. > > > > On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > >> On 27/05/2017 9:28 PM, GlenB wrote: >> >>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>> or >>> 3 >>> >>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>> intercept -1 and slope 1.2 >>> >>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>> of >>> length 6, with four of every six correct. >>> >>> Bug has been present across many versions. >>> >>> The machine I just tried it on just now has R3.2.3: >>> >> >> If you look at the source (in src/library/stats/src/line.c), the >> explanation is clear: the x value is chosen as the 1/6 quantile (according >> to a particular definition of quantile), and the y value is chosen as the >> median of the y values where x is less than or equal to the 1/3 quantile. >> Those are different definitions (though I think they would be >> asymptotically equivalent under pretty weak assumptions), so it's not >> surprising the x value doesn't correspond perfectly to the y value, and the >> line ends up "wrong". >> >> So is it a bug? Well, that depends on Tukey's definition. I don't have >> a copy of his book handy so I can't really say. Maybe the R function is >> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >> is wrong. >> >> Duncan Murdoch >> >> >[[alternative HTML version deleted]]
peter dalgaard
2017-May-29 08:02 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
A usually trustworthy R correspondent posted a pure R implementation on SO at some point in his lost youth: https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line This one does indeed generate the line of identity for the (1:9, 1:9) case, so I do suspect that we have a genuine scr*wup in line(). Notice, incidentally, that> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))Call: line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) Coefficients: [1] -0.9407 1.1948 I.e., it is not likely an issue with exact integers or perfect fit. -pd> On 29 May 2017, at 07:21 , GlenB <glnbrntt at gmail.com> wrote: > >> Tukey divides the points into three groups, not the x and y values > separately. > >> I'll try to get hold of the book for a direct quote, might take a couple > of days. > > Ah well, I can't get it for a week. But the fact that it's often called > Tukey's three group line (try a search on *tukey three group line* and > you'll get plenty of hits) is pretty much a giveaway. > > > On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbrntt at gmail.com> wrote: > >> Tukey divides the points into three groups, not the x and y values >> separately. >> >> I'll try to get hold of the book for a direct quote, might take a couple >> of days. >> >> >> >> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >> wrote: >> >>> On 27/05/2017 9:28 PM, GlenB wrote: >>> >>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>>> or >>>> 3 >>>> >>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>>> intercept -1 and slope 1.2 >>>> >>>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>>> of >>>> length 6, with four of every six correct. >>>> >>>> Bug has been present across many versions. >>>> >>>> The machine I just tried it on just now has R3.2.3: >>>> >>> >>> If you look at the source (in src/library/stats/src/line.c), the >>> explanation is clear: the x value is chosen as the 1/6 quantile (according >>> to a particular definition of quantile), and the y value is chosen as the >>> median of the y values where x is less than or equal to the 1/3 quantile. >>> Those are different definitions (though I think they would be >>> asymptotically equivalent under pretty weak assumptions), so it's not >>> surprising the x value doesn't correspond perfectly to the y value, and the >>> line ends up "wrong". >>> >>> So is it a bug? Well, that depends on Tukey's definition. I don't have >>> a copy of his book handy so I can't really say. Maybe the R function is >>> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >>> is wrong. >>> >>> Duncan Murdoch >>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Possibly Parallel Threads
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3