Serguei Sokol
2017-May-31 16:46 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Le 31/05/2017 ? 17:30, Serguei Sokol a ?crit :> > More thorough reading revealed that I have overlooked this phrase in the > line's doc: "left and right /thirds/ of the data" (emphasis is mine).Oops. I have read the first ref returned by google and it happened to be tibco's doc, not the R's one. The layout is very similar hence my mistake. The latter does not mention "thirds" but ... Anyway, here is a new line's patch which still gives a result slightly different form MMline(). The slope is the same but not the intercept. What are the exact terms for intercept calculation that should be implemented? Serguei. -------------- next part -------------- --- line.c.orig 2016-03-17 00:03:03.000000000 +0100 +++ line.c 2017-05-31 18:24:55.330030280 +0200 @@ -17,7 +17,7 @@ * https://www.R-project.org/Licenses/ */ -#include <R_ext/Utils.h> /* R_rsort() */ +#include <R_ext/Utils.h> /* R_rsort() */ #include <math.h> #include <Rinternals.h> @@ -25,8 +25,8 @@ /* Speed up by `inlining' these (as macros) [since R version 1.2] : */ #if 1 -#define il(n,x) (int)floor((n - 1) * x) -#define iu(n,x) (int) ceil((n - 1) * x) +#define il(n,x) (int)floor(((n) - 1) * (x)) +#define iu(n,x) (int) ceil(((n) - 1) * (x)) #else static int il(int n, double x) @@ -41,71 +41,68 @@ #endif static void line(double *x, double *y, /* input (x[i],y[i])s */ - double *z, double *w, /* work and output: resid. & fitted */ - /* all the above of length */ int n, - double coef[2]) + double *z, double *w, /* work and output: resid. & fitted */ + int *indx, /* to get thirds of points independently of repeated values */ + /* all the above of length */ int n, + double coef[2]) { int i, j, k; double xb, x1, x2, xt, yt, yb, tmp1, tmp2; double slope, yint; for(i = 0 ; i < n ; i++) { - z[i] = x[i]; - w[i] = y[i]; + z[i] = x[i]; + w[i] = y[i]; + indx[i] = i; } - R_rsort(z, n);/* z = ordered abscissae */ + rsort_with_index(z, indx, n);/* z = ordered abscissae */ - tmp1 = z[il(n, 1./6.)]; - tmp2 = z[iu(n, 1./6.)]; xb = 0.5*(tmp1+tmp2); - - tmp1 = z[il(n, 2./6.)]; - tmp2 = z[iu(n, 2./6.)]; x1 = 0.5*(tmp1+tmp2); - - tmp1 = z[il(n, 4./6.)]; - tmp2 = z[iu(n, 4./6.)]; x2 = 0.5*(tmp1+tmp2); - - tmp1 = z[il(n, 5./6.)]; - tmp2 = z[iu(n, 5./6.)]; xt = 0.5*(tmp1+tmp2); + k=(n+1)/3; + tmp1 = z[il(k, 0.5)]; + tmp2 = z[iu(k, 0.5)]; + /* xb := Median(first third of x) */ + xb = 0.5*(tmp1+tmp2); + + tmp1 = z[n-k+il(k, 0.5)]; + tmp2 = z[n-k+iu(k, 0.5)]; + /* xt := Median(last third of x) */ + xt = 0.5*(tmp1+tmp2); slope = 0.; for(j = 1 ; j <= 1 ; j++) { - /* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */ - k = 0; - for(i = 0 ; i < n ; i++) - if(x[i] <= x1) - z[k++] = w[i]; - R_rsort(z, k); - yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]); - - /* yt := Median(y[i]; x[i] >= quantile(x, 2/3) */ - k = 0; - for(i = 0 ; i < n ; i++) - if(x[i] >= x2) - z[k++] = w[i]; - R_rsort(z,k); - yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]); - - slope += (yt - yb)/(xt - xb); - for(i = 0 ; i < n ; i++) { - z[i] = y[i] - slope*x[i]; - /* never used: w[i] = z[i]; */ - } - R_rsort(z,n); - yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]); + /* yb := Median(y[i]; x[i] in first third of x) */ + for(i = 0 ; i < k ; i++) + z[i] = w[indx[i]]; + R_rsort(z, k); + yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]); + + /* yt := Median(y[i]; x[i] in last third of x */ + for(i = 0 ; i < k ; i++) + z[i] = w[indx[n-k+i]]; + R_rsort(z,k); + yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]); + + slope += (yt - yb)/(xt - xb); + for(i = 0 ; i < n ; i++) { + z[i] = y[i] - slope*x[i]; + /* never used: w[i] = z[i]; */ + } + R_rsort(z,n); + yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]); } for( i = 0 ; i < n ; i++ ) { - w[i] = yint + slope*x[i]; - z[i] = y[i] - w[i]; + w[i] = yint + slope*x[i]; + z[i] = y[i] - w[i]; } coef[0] = yint; coef[1] = slope; } -void tukeyline0(double *x, double *y, double *z, double *w, int *n, - double *coef) +void tukeyline0(double *x, double *y, double *z, double *w, int *indx, int *n, + double *coef) { - line(x, y, z, w, *n, coef); + line(x, y, z, w, indx, *n, coef); } SEXP tukeyline(SEXP x, SEXP y, SEXP call) @@ -127,7 +124,8 @@ SET_VECTOR_ELT(ans, 2, res); SEXP fit = allocVector(REALSXP, n); SET_VECTOR_ELT(ans, 3, fit); - line(REAL(x), REAL(y), REAL(res), REAL(fit), n, REAL(coef)); + SEXP indx = allocVector(INTSXP, n); + line(REAL(x), REAL(y), REAL(res), REAL(fit), INTEGER(indx), n, REAL(coef)); UNPROTECT(1); return ans; }
Martin Maechler
2017-May-31 20:00 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>> on Wed, 31 May 2017 18:46:34 +0200 writes:> Le 31/05/2017 ? 17:30, Serguei Sokol a ?crit : >> >> More thorough reading revealed that I have overlooked this phrase in the >> line's doc: "left and right /thirds/ of the data" (emphasis is mine). > Oops. I have read the first ref returned by google and it happened to be > tibco's doc, not the R's one. The layout is very similar hence my mistake. > The latter does not mention "thirds" but ... > Anyway, here is a new line's patch which still gives a result slightly different > form MMline(). The slope is the same but not the intercept. > What are the exact terms for intercept calculation that should be implemented? > Serguei. Sorry Serguei, I have new version of line.c since yesterday, and will not be disturbed anymore. Note that I *did* give the litterature, and it seems most discussants don't have paper books in physical libraries anymore; In this case, interestingly, you need one of those I think - almost everything I found online did not have the exact details. Peter Dalgaard definitely was right that Tukey did not use quantiles at all, and notably did *not* define the three groups via {i; x_i <= x_L} and {i; x_i >= X_R} which (as I think you noticed) may make the groups quite unbalanced in case of duplicated x's. But then, for now I had decided to fix the bug (namely computing the x-medians wrongly as you diagnosed correctly(!) -- but your first 2 patches only fixed partly) *and* go at least one step in the direction of Tukey's original, namely by allowing iteration via a new 'iter' argument. I have also updated the help page to document what line() has been computing all these years {apart from the bug which typically shows for non-equidistant x[]}. We could also consider to eventually add a new 'method = <string>' argument to line() one version of which would continue to compute the current solution, another would compute the one corresponding to Velleman & Hoaglin (1981)'s FORTRAN implementation (which had to be corrected for some infinite-loop cases!)... not in the close future though Given all this discussions here, I think I should commit what I currently have ASAP. Martin > x[DELETED ATTACHMENT line.c.patch2, plain text]
Serguei Sokol
2017-Jun-01 09:40 UTC
[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Le 31/05/2017 ? 22:00, Martin Maechler a ?crit :>>>>>> Serguei Sokol <sokol at insa-toulouse.fr> >>>>>> on Wed, 31 May 2017 18:46:34 +0200 writes: > > Le 31/05/2017 ? 17:30, Serguei Sokol a ?crit : > >> > >> More thorough reading revealed that I have overlooked this phrase in the > >> line's doc: "left and right /thirds/ of the data" (emphasis is mine). > > Oops. I have read the first ref returned by google and it happened to be > > tibco's doc, not the R's one. The layout is very similar hence my mistake. > > The latter does not mention "thirds" but ... > > Anyway, here is a new line's patch which still gives a result slightly different > > form MMline(). The slope is the same but not the intercept. > > What are the exact terms for intercept calculation that should be implemented? > > > Serguei. > > Sorry Serguei, I have new version of line.c since yesterday, > and will not be disturbed anymore. > > Note that I *did* give the litterature, and it seems most > discussants don't have paper books in physical libraries anymore; > In this case, interestingly, you need one of those I think - > almost everything I found online did not have the exact details.Fortunately, you keep old good habits regarding paper books ;)> Peter Dalgaard definitely was right that Tukey did not use > quantiles at all, and notably did *not* define the three groups > via {i; x_i <= x_L} and {i; x_i >= X_R} which (as I think > you noticed) may make the groups quite unbalanced in case of duplicated x's. > > But then, for now I had decided to fix the bug (namely computing > the x-medians wrongly as you diagnosed correctly(!) -- but your > first 2 patches only fixed partly) *and* go at least one step in > the direction of Tukey's original, namely by allowing iteration via a new 'iter' argument.Hm, I did not use iterations. A newly introduced indx is used to keep index permutation when x is sorted.> I have also updated the help page to document what line() has > been computing all these years {apart from the bug which > typically shows for non-equidistant x[]}.You mean "non equally sized"? (bis ;) )> We could also consider to eventually add a new 'method = <string>' > argument to line() one version of which would continue to > compute the current solution,If the current solution is considered as plainly wrong, why to continue to implement it? Unless "by current version" you mean your implementation equivalent to my patch2 which fixes group sizes.> another would compute the one > corresponding to Velleman & Hoaglin (1981)'s FORTRAN > implementation (which had to be corrected for some infinite-loop > cases!)... not in the close future thoughWhat would be the interest of this fortran version? Faster? More accurate?> Given all this discussions here, I think I should commit what I > currently have ASAP.+1. Serguei.
Seemingly Similar Threads
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
- stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3