Martin Maechler
2017-May-30 07:33 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>> on Mon, 29 May 2017 15:28:12 +0200 writes:> Sorry, I have seen it too late that we had different tab > width in the original file and my editor. Here is the > patch with all white spaces instead of mixing tabs and > white spaces. thank you - it still gives quite a few unnecessary (i.e. "white space") diffs with the source code, but that's not the problem : The patch leads to correct results for the simple (1:k, 1:k) data sets (for all k). However, even after the patch, The example from the SO post differs from the result of Richie Cotton's function...(even though that function had a silly bug in step 1, the bug has not been "kicking" for the example): Here's a fixed-up version of the pure R function and the example and some comments : ## From Stackoverflow ## http://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line ## median_median_line by Richie Cotton (July 12 2010, last edited at 13:49) ## ## Shorter variable names, fixed bug in step 1, added 'plot.' option: Martin Maechler, May 2017 MMline <- function(x, y, data, plot. = FALSE) { if(!missing(data)) { x <- eval(substitute(x), data) y <- eval(substitute(y), data) } stopifnot((n <- length(x)) == length(y), n >= 2) ## Step 1 n.3 <- n %/% 3L groups <- rep(1:3, times = switch((n %% 3) + 1L, n.3, c(n.3, n.3 + 1L, n.3), c(n.3 + 1L, n.3, n.3 + 1L) )) groups <- lapply(list(groups), as.factor) # (gain a bit in tapply()) ## Step 2 ## sort *both* x & y "along x": x <- sort.int(x, index.return = TRUE) y <- y[x$ix] x <- x$x if(plot.) plot(x,y) ## Step 3 m_x <- tapply(x, groups, median) m_y <- tapply(y, groups, median) if(plot.) { points(m_x, m_y, cex=2, pch=3, col="red") segments(m_x[1],m_y[1], m_x[3],m_y[3], col="red") } ## Step 4 R <- if(n == 2) 2L else 3L slope <- (m_y[[R]] - m_y[[1]]) / (m_x[[R]] - m_x[[1]]) intercept <- m_y[[1]] - slope * m_x[[1]] ## Step 5 mid_y <- intercept + slope * m_x[[2]] intercept <- intercept + (m_y[[2]] - mid_y) / 3 if(plot.) abline(a = intercept, b = slope, col=adjustcolor("midnight blue", .5), lwd=2) c(intercept = intercept, slope = slope) } ## To test it, here's the second example from that page: dfr <- data.frame( time = c( .16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89), distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1,124.6,129.7, 150.2,182.2,189.4,220.4,250.4, 261.0,334.5,375.5,399.1)) MMline(time, distance, dfr, plot.=TRUE) ## intercept slope ## -113.6 520.0 par(new=TRUE)# should plot identically! MMline(time, distance, dfr[sample.int(nrow(dfr)), ], plot.=TRUE) ## Note the slightly odd way of specifying the groups. The instructions are ## quite picky about how you define group sizes, so the more obvious method ## of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work. ## edited Jul 12 '10 at 13:49 / answered Jul 12 '10 at 13:36 ## Richie Cotton ## And someone remarked that R's line() did not give the same: with(dfr, line(time, distance)) ## ... ## Coefficients: ## [1] -108.8 505.2 abline(-108.8, 505.2, col="blue") ##=> this one is wrong ## MM: (cfs <- t(sapply(setNames(,2:50), function(k) {x <- 1:k; MMline(x,x)}))) ## intercept slope ## 2 0 1 # (special case fixed) ## 3 0 1 ## 4 0 1 ## 5 0 1 ## ............ ## 49 0 1 ## 50 0 1 ## "Harder": (sort ing!) cf2 <- t(sapply(setNames(,2:50), function(k) {x <- sample.int(k); MMline(x, -10*x)})) stopifnot(abs(cf2[,"intercept"] - 0) < 1e-12, abs(cf2[, "slope"] - -10) < 1e-12)
Serguei Sokol
2017-May-30 14:01 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Le 30/05/2017 ? 09:33, Martin Maechler a ?crit : ...> However, even after the patch, > The example from the SO post differs from the result of Richie > Cotton's function...The explanation is quite simple. In SO function, the first 1/3 quantile of used example counts 6 points (of 19 in total), while line()'s definition of quantile leads to 8 points. The same numbers (6 and 8) are on the other end of sample. In x sample, there are few repeated values, this is certainly be the reason of different quantiles.. I am not sure that one quantile definition is better or more correct than the other. So I would leave line()'s definition as is. Best, Sergue?.
Martin Maechler
2017-May-30 16:51 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>> on Tue, 30 May 2017 16:01:17 +0200 writes:> Le 30/05/2017 ? 09:33, Martin Maechler a ?crit : ... >> However, even after the patch, The example from the SO >> post differs from the result of Richie Cotton's >> function... > The explanation is quite simple. In SO function, the first > 1/3 quantile of used example counts 6 points (of 19 in > total), while line()'s definition of quantile leads to 8 > points. The same numbers (6 and 8) are on the other end of > sample. so the number of obs. for the three thirds for line() are {8, 3, 8} in line() [also, after your patch, right?] whereas in MMline() they are as they should be, namely {6, 7, 6} But the {8, 3, 8} split is not at all what all "the literature", including Tukey himself says that "should" be done. (Other literature on the topic suggests that the optimal sizes of the split in three groups depends on the distribution of x ..) OTOH, MMline() does exactly what "the literature" and also the reference on the ?line help pages says. > In x sample, there are few repeated values, this > is certainly be the reason of different quantiles.. > I am not sure that one quantile definition is better or > more correct than the other. > So I would leave line()'s definition as is. you mean _after_ applying your patch, I assume. I currently tend do disagree. If we change line() we should rather fix more .. Note the 'Subject' you've chosen for this thread, "... does not produce the correct Tukey line", so I think we should get better. Apart from Richie / my MMline() function, I've also noticed that ACSWR :: resistant_line() exists. However "the literature" (see references below), notably the two with Hoaglin, strongly recommends smarter iterations, and -- lo and behold! -- when this topic came up last (for me) in Dec. 2014, I did spend about 2 days work (or more?) to get the FORTRAN code from the 1981 - book (which is abbreviated the "ABC of EDA") from a somewhat useful OCR scan into compilable Fortran code and then f2c'ed, wrote an R interface function found problems i.e., bugs, including infinite loops, fixed most AFAICS, but somehow did not finish making the result available. Yes, and I have too many other things on my desk... this will have to wait! References: Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading Massachusetts: Addison-Wesley. Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics and Computing of Exploratory Data Analysis_ Duxbury Press. Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y versus x. Chapter 5 of _Understanding Robust and Exploratory Data Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W. Tukey. Wiley. Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line and Related Regression Methods. _Journal of the American Statistical Association_ *80*, 1041-1054. <URL: https://dx.doi.org/10.1080/01621459.1985.10478222> > Best, Sergue?. Martin Maechler, ETH Zurich (and R core team)