Martin Maechler
2017-May-30 16:51 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>> on Tue, 30 May 2017 16:01:17 +0200 writes:> Le 30/05/2017 ? 09:33, Martin Maechler a ?crit : ... >> However, even after the patch, The example from the SO >> post differs from the result of Richie Cotton's >> function... > The explanation is quite simple. In SO function, the first > 1/3 quantile of used example counts 6 points (of 19 in > total), while line()'s definition of quantile leads to 8 > points. The same numbers (6 and 8) are on the other end of > sample. so the number of obs. for the three thirds for line() are {8, 3, 8} in line() [also, after your patch, right?] whereas in MMline() they are as they should be, namely {6, 7, 6} But the {8, 3, 8} split is not at all what all "the literature", including Tukey himself says that "should" be done. (Other literature on the topic suggests that the optimal sizes of the split in three groups depends on the distribution of x ..) OTOH, MMline() does exactly what "the literature" and also the reference on the ?line help pages says. > In x sample, there are few repeated values, this > is certainly be the reason of different quantiles.. > I am not sure that one quantile definition is better or > more correct than the other. > So I would leave line()'s definition as is. you mean _after_ applying your patch, I assume. I currently tend do disagree. If we change line() we should rather fix more .. Note the 'Subject' you've chosen for this thread, "... does not produce the correct Tukey line", so I think we should get better. Apart from Richie / my MMline() function, I've also noticed that ACSWR :: resistant_line() exists. However "the literature" (see references below), notably the two with Hoaglin, strongly recommends smarter iterations, and -- lo and behold! -- when this topic came up last (for me) in Dec. 2014, I did spend about 2 days work (or more?) to get the FORTRAN code from the 1981 - book (which is abbreviated the "ABC of EDA") from a somewhat useful OCR scan into compilable Fortran code and then f2c'ed, wrote an R interface function found problems i.e., bugs, including infinite loops, fixed most AFAICS, but somehow did not finish making the result available. Yes, and I have too many other things on my desk... this will have to wait! References: Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading Massachusetts: Addison-Wesley. Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics and Computing of Exploratory Data Analysis_ Duxbury Press. Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y versus x. Chapter 5 of _Understanding Robust and Exploratory Data Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W. Tukey. Wiley. Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line and Related Regression Methods. _Journal of the American Statistical Association_ *80*, 1041-1054. <URL: https://dx.doi.org/10.1080/01621459.1985.10478222> > Best, Sergue?. Martin Maechler, ETH Zurich (and R core team)
GlenB
2017-May-31 04:13 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Martin Maechler says in reply to Sergue? Sokol> Note the 'Subject' you've chosen for this thread,"... does not produce the correct Tukey line", The choice of title was mine not Serguei's; I posted the original message where the error was pointed out I agree with Martin's assessment that the correct split (both by Tukey's lights and by general practice) for 19 points would be 6,7,6 and I also agree that it's better to "fix more" in this instance, where possible. (e.g. Johnstone&Velleman's standard errors would be a nice thing to add if feasible) -- but if any blame is attached to the choice of title, it really should be aimed at me. Glen On Wed, May 31, 2017 at 2:51 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:> >>>>> Serguei Sokol <sokol at insa-toulouse.fr> > >>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: > > > Le 30/05/2017 ? 09:33, Martin Maechler a ?crit : ... > >> However, even after the patch, The example from the SO > >> post differs from the result of Richie Cotton's > >> function... > > The explanation is quite simple. In SO function, the first > > 1/3 quantile of used example counts 6 points (of 19 in > > total), while line()'s definition of quantile leads to 8 > > points. The same numbers (6 and 8) are on the other end of > > sample. > > so the number of obs. for the three thirds for line() are > {8, 3, 8} in line() [also, after your patch, right?] > > whereas in MMline() they are as they should be, namely > > {6, 7, 6} > > But the {8, 3, 8} split is not at all what all "the literature", > including Tukey himself says that "should" be done. > (Other literature on the topic suggests that the optimal sizes > of the split in three groups depends on the distribution of x ..) > > OTOH, MMline() does exactly what "the literature" and also the > reference on the ?line help pages says. > > > In x sample, there are few repeated values, this > > is certainly be the reason of different quantiles.. > > > I am not sure that one quantile definition is better or > > more correct than the other. > > > So I would leave line()'s definition as is. > > you mean _after_ applying your patch, I assume. > > I currently tend do disagree. If we change line() we should > rather fix more .. > Note the 'Subject' you've chosen for this thread, > "... does not produce the correct Tukey line", > so I think we should get better. > > Apart from Richie / my MMline() function, I've also noticed > that ACSWR :: resistant_line() > exists. > > However "the literature" (see references below), notably the two > with Hoaglin, strongly recommends smarter iterations, and > -- lo and behold! -- when this topic came up last (for me) in > Dec. 2014, I did spend about 2 days work (or more?) to get the > FORTRAN code from the 1981 - book (which is abbreviated the > "ABC of EDA") from a somewhat useful OCR scan into compilable > Fortran code and then f2c'ed, wrote an R interface function > found problems i.e., bugs, including infinite loops, fixed most > AFAICS, but somehow did not finish making the result available. > > Yes, and I have too many other things on my desk... this will > have to wait! > > References: > > Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading > Massachusetts: Addison-Wesley. > > Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics > and Computing of Exploratory Data Analysis_ Duxbury Press. > > Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y > versus x. Chapter 5 of _Understanding Robust and Exploratory Data > Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W. > Tukey. Wiley. > > Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line > and Related Regression Methods. _Journal of the American > Statistical Association_ *80*, 1041-1054. <URL: > https://dx.doi.org/10.1080/01621459.1985.10478222> > > > > Best, Sergue?. > > Martin Maechler, ETH Zurich (and R core team) > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Serguei Sokol
2017-May-31 13:06 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Le 30/05/2017 ? 18:51, Martin Maechler a ?crit :>>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: > > Le 30/05/2017 ? 09:33, Martin Maechler a ?crit : ... > >> However, even after the patch, The example from the SO > >> post differs from the result of Richie Cotton's > >> function... > > The explanation is quite simple. In SO function, the first > > 1/3 quantile of used example counts 6 points (of 19 in > > total), while line()'s definition of quantile leads to 8 > > points. The same numbers (6 and 8) are on the other end of > > sample. > > so the number of obs. for the three thirds for line() are > {8, 3, 8} in line() [also, after your patch, right?] > > whereas in MMline() they are as they should be, namely > > {6, 7, 6} > > But the {8, 3, 8} split is not at all what all "the literature", > including Tukey himself says that "should" be done. > (Other literature on the topic suggests that the optimal sizes > of the split in three groups depends on the distribution of x ..) > > OTOH, MMline() does exactly what "the literature" and also the > reference on the ?line help pages says.Well, what I have seen so far in "literature" was mention of 1/3 quantiles (but, yes I could overlook smth as I did not spend too much time on it) So the sample distribution in three groups boils down to a particular quantile definition to use. It turns out that the line()'s version (you are right, _after_ the patch but my patch left this definition untouched) is consistent with the R's one. If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not 6 (and the same on the 2/3 end). To my mind, consistency with the rest of R, namely with the quantile definition, is an argument good enough to let the line()'s definition as is. Serguei.
Joris Meys
2017-May-31 13:40 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
OTOH,> sapply(1:9, function(i){+ sum(dfr$time <= quantile(dfr$time, 1./3., type = i)) + }) [1] 8 8 6 6 6 6 8 6 6 Only the default (type = 7) and the first two types give the result lines() gives now. I think there is plenty of reasons to give why any of the other 6 types might be better suited in Tukey's method. So to my mind, chaning the definition of line() to give sensible output that is in accordance with the theory, does not imply any inconsistency with the quantile definition in R. At least not with 6 out of the 9 different ones ;-) Cheers Joris On Wed, May 31, 2017 at 3:06 PM, Serguei Sokol <sokol at insa-toulouse.fr> wrote:> Le 30/05/2017 ? 18:51, Martin Maechler a ?crit : > >> Serguei Sokol <sokol at insa-toulouse.fr> >>>>>>> on Tue, 30 May 2017 16:01:17 +0200 writes: >>>>>>> >>>>>> > Le 30/05/2017 ? 09:33, Martin Maechler a ?crit : ... >> >> However, even after the patch, The example from the SO >> >> post differs from the result of Richie Cotton's >> >> function... >> > The explanation is quite simple. In SO function, the first >> > 1/3 quantile of used example counts 6 points (of 19 in >> > total), while line()'s definition of quantile leads to 8 >> > points. The same numbers (6 and 8) are on the other end of >> > sample. >> >> so the number of obs. for the three thirds for line() are >> {8, 3, 8} in line() [also, after your patch, right?] >> >> whereas in MMline() they are as they should be, namely >> >> {6, 7, 6} >> >> But the {8, 3, 8} split is not at all what all "the literature", >> including Tukey himself says that "should" be done. >> (Other literature on the topic suggests that the optimal sizes >> of the split in three groups depends on the distribution of x ..) >> >> OTOH, MMline() does exactly what "the literature" and also the >> reference on the ?line help pages says. >> > Well, what I have seen so far in "literature" was mention of 1/3 quantiles > (but, yes I could overlook smth as I did not spend too much time on it) > So the sample distribution in three groups boils down to a particular > quantile > definition to use. It turns out that the line()'s version (you are right, > _after_ the patch > but my patch left this definition untouched) is consistent with the R's > one. > If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not 6 > (and the same on the 2/3 end). > To my mind, consistency with the rest of R, namely with the quantile > definition, > is an argument good enough to let the line()'s definition as is. > > Serguei. > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Mathematical Modelling, Statistics and Bio-Informatics tel : +32 (0)9 264 61 79 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Apparently Analagous Threads
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3