thr3ads.net - R devel - [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3 [May 2017]

If this information is useful, please help other people find it:
Share via:

Joris Meys

2017-May-31 14:39 UTC

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Seriously, if a method gives a wrong result, it's wrong. line() does NOT
implement the algorithm of Tukey, even not after the patch. We're not
discussing Excel here, are we?

The method of Tukey is rather clear, and it is NOT using the default
quantile definition from the quantile function. Actually, it doesn't even
use quantiles to define the groups. It just says that the groups should be
more or less equally spaced. As the method of Tukey relies on the medians
of the subgroups, it would make sense to pick a method that is
approximately unbiased with regard to the median. That would be type 8
imho.

To get the size of the outer groups, Tukey would've been more than happy
enough with a:
> floor(length(dfr$time) / 3)[1] 6

There you have the size of your left and right group, and now we can
discuss about which median type should be used for the robust fitting.

But I can honestly not understand why anyone in his right mind would defend
a method that is clearly wrong while not working at Microsoft's spreadsheet
department.

Cheers
Joris

On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at insa-toulouse.fr>
wrote:
> Le 31/05/2017 ? 15:40, Joris Meys a ?crit :
>
>> OTOH,
>>
>> > sapply(1:9, function(i){
>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>> + })
>> [1] 8 8 6 6 6 6 8 6 6
>>
>> Only the default (type = 7) and the first two types give the result
>> lines() gives now. I think there is plenty of reasons to give why any
of
>> the other 6 types might be better suited in Tukey's method.
>>
>> So to my mind, chaning the definition of line() to give sensible output
>> that is in accordance with the theory, does not imply any inconsistency
>> with the quantile definition in R. At least not with 6 out of the 9
>> different ones ;-)
>>
> Nice shot.
> But OTOE (on the other end ;)
> > sapply(1:9, function(i){
> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
> + })
> [1] 8 8 8 8 6 6 8 6 6
>
> Here "8" gains 5 votes against 4 for "6". There were
two defector methods
> that changed the point number and should be discarded. Which leaves us
> with the score 3:4, still in favor of "6" but the default method
should
> prevail
> in my sens.
>
> Serguei.
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Joris Meys

2017-May-31 14:40 UTC

head link

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

And with "equally spaced" I obviously meant "of equal size".
It's getting
too hot in the office here...

On Wed, May 31, 2017 at 4:39 PM, Joris Meys <jorismeys at gmail.com>
wrote:
> Seriously, if a method gives a wrong result, it's wrong. line() does
NOT
> implement the algorithm of Tukey, even not after the patch. We're not
> discussing Excel here, are we?
>
> The method of Tukey is rather clear, and it is NOT using the default
> quantile definition from the quantile function. Actually, it doesn't
even
> use quantiles to define the groups. It just says that the groups should be
> more or less equally spaced. As the method of Tukey relies on the medians
> of the subgroups, it would make sense to pick a method that is
> approximately unbiased with regard to the median. That would be type 8
> imho.
>
> To get the size of the outer groups, Tukey would've been more than
happy
> enough with a:
>
> > floor(length(dfr$time) / 3)
> [1] 6
>
> There you have the size of your left and right group, and now we can
> discuss about which median type should be used for the robust fitting.
>
> But I can honestly not understand why anyone in his right mind would
> defend a method that is clearly wrong while not working at Microsoft's
> spreadsheet department.
>
> Cheers
> Joris
>
> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at
insa-toulouse.fr>
> wrote:
>
>> Le 31/05/2017 ? 15:40, Joris Meys a ?crit :
>>
>>> OTOH,
>>>
>>> > sapply(1:9, function(i){
>>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>>> + })
>>> [1] 8 8 6 6 6 6 8 6 6
>>>
>>> Only the default (type = 7) and the first two types give the result
>>> lines() gives now. I think there is plenty of reasons to give why
any of
>>> the other 6 types might be better suited in Tukey's method.
>>>
>>> So to my mind, chaning the definition of line() to give sensible
output
>>> that is in accordance with the theory, does not imply any
inconsistency
>>> with the quantile definition in R. At least not with 6 out of the 9
>>> different ones ;-)
>>>
>> Nice shot.
>> But OTOE (on the other end ;)
>> > sapply(1:9, function(i){
>> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>> + })
>> [1] 8 8 8 8 6 6 8 6 6
>>
>> Here "8" gains 5 votes against 4 for "6". There
were two defector methods
>> that changed the point number and should be discarded. Which leaves us
>> with the score 3:4, still in favor of "6" but the default
method should
>> prevail
>> in my sens.
>>
>> Serguei.
>>
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

peter dalgaard

2017-May-31 14:57 UTC

head link

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

> On 31 May 2017, at 16:40 , Joris Meys <jorismeys at gmail.com> wrote:
> 
> And with "equally spaced" I obviously meant "of equal
size". It's getting
> too hot in the office here...
We have a fair amount of cool westerly wind up here that I could transfer to you
via  WWTP (Wind and Weather Transport Protocol). If you open up a sufficiently
large pipe, that is.

Anyways, in the past we have tried to follow Tukey's instructions on details
like the definition of the "hinges" on boxplots, so presumably we
should try and do likewise for this case.

I suspect that Tukey would say "divide the data into three roughly
equal-sized groups" or some such. The obvious thing to do would be to
allocate N %/% 3 to each group and then the N %% 3 remaining symmetrically and
as evenly as possible, which in my book would rather be (1,0,1) than (0, 2, 0)
for the case N %% 3 == 2. If  N %% 3 == 1, there is no alternative to (0, 1, 0)
by this logic.
> 
> On Wed, May 31, 2017 at 4:39 PM, Joris Meys <jorismeys at gmail.com>
wrote:
> 
>> Seriously, if a method gives a wrong result, it's wrong. line()
does NOT
>> implement the algorithm of Tukey, even not after the patch. We're
not
>> discussing Excel here, are we?
>> 
>> The method of Tukey is rather clear, and it is NOT using the default
>> quantile definition from the quantile function. Actually, it
doesn't even
>> use quantiles to define the groups. It just says that the groups should
be
>> more or less equally spaced. As the method of Tukey relies on the
medians
>> of the subgroups, it would make sense to pick a method that is
>> approximately unbiased with regard to the median. That would be type 8
>> imho.
>> 
>> To get the size of the outer groups, Tukey would've been more than
happy
>> enough with a:
>> 
>>> floor(length(dfr$time) / 3)
>> [1] 6
>> 
>> There you have the size of your left and right group, and now we can
>> discuss about which median type should be used for the robust fitting.
>> 
>> But I can honestly not understand why anyone in his right mind would
>> defend a method that is clearly wrong while not working at
Microsoft's
>> spreadsheet department.
>> 
>> Cheers
>> Joris
>> 
>> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at
insa-toulouse.fr>
>> wrote:
>> 
>>> Le 31/05/2017 ? 15:40, Joris Meys a ?crit :
>>> 
>>>> OTOH,
>>>> 
>>>>> sapply(1:9, function(i){
>>>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>>>> + })
>>>> [1] 8 8 6 6 6 6 8 6 6
>>>> 
>>>> Only the default (type = 7) and the first two types give the
result
>>>> lines() gives now. I think there is plenty of reasons to give
why any of
>>>> the other 6 types might be better suited in Tukey's method.
>>>> 
>>>> So to my mind, chaning the definition of line() to give
sensible output
>>>> that is in accordance with the theory, does not imply any
inconsistency
>>>> with the quantile definition in R. At least not with 6 out of
the 9
>>>> different ones ;-)
>>>> 
>>> Nice shot.
>>> But OTOE (on the other end ;)
>>>> sapply(1:9, function(i){
>>> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>>> + })
>>> [1] 8 8 8 8 6 6 8 6 6
>>> 
>>> Here "8" gains 5 votes against 4 for "6". There
were two defector methods
>>> that changed the point number and should be discarded. Which leaves
us
>>> with the score 3:4, still in favor of "6" but the default
method should
>>> prevail
>>> in my sens.
>>> 
>>> Serguei.
>>> 
>> 
>> 
>> 
>> --
>> Joris Meys
>> Statistical consultant
>> 
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Mathematical Modelling, Statistics and Bio-Informatics
>> 
>> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>> 
> 
> 
> 
> -- 
> Joris Meys
> Statistical consultant
> 
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
> 
> tel :  +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Serguei Sokol

2017-May-31 15:30 UTC

head link

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Le 31/05/2017 ? 16:39, Joris Meys a ?crit :> Seriously, if a method gives a wrong result, it's wrong.I did not understand why you and others were using term "wrong"
based on something that I was considering as just "different"
implementation.
More thorough reading revealed that I have overlooked this phrase in the
line's doc: "left and right /thirds/ of the data" (emphasis is
mine).

Should I be exiled to Excel department for this sin? That's tough ;)
Serguei.
> line() does NOT implement the algorithm of Tukey, even not after the patch.
We're not discussing Excel here, are we?
>
> The method of Tukey is rather clear, and it is NOT using the default
quantile definition from the quantile function. Actually, it doesn't even
use quantiles
> to define the groups. It just says that the groups should be more or less
equally spaced. As the method of Tukey relies on the medians of the subgroups,
it
> would make sense to pick a method that is approximately unbiased with
regard to the median. That would be type 8 imho.
>
> To get the size of the outer groups, Tukey would've been more than
happy enough with a:
>
> > floor(length(dfr$time) / 3)
> [1] 6
>
> There you have the size of your left and right group, and now we can
discuss about which median type should be used for the robust fitting.
>
> But I can honestly not understand why anyone in his right mind would defend
a method that is clearly wrong while not working at Microsoft's spreadsheet
> department.
>
> Cheers
> Joris
>
> On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at
insa-toulouse.fr <mailto:sokol at insa-toulouse.fr>> wrote:
>
>     Le 31/05/2017 ? 15:40, Joris Meys a ?crit :
>
>         OTOH,
>
>         > sapply(1:9, function(i){
>         +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>         + })
>         [1] 8 8 6 6 6 6 8 6 6
>
>         Only the default (type = 7) and the first two types give the result
lines() gives now. I think there is plenty of reasons to give why any of the
other
>         6 types might be better suited in Tukey's method.
>
>         So to my mind, chaning the definition of line() to give sensible
output that is in accordance with the theory, does not imply any inconsistency
with
>         the quantile definition in R. At least not with 6 out of the 9
different ones ;-)
>
>     Nice shot.
>     But OTOE (on the other end ;)
>     > sapply(1:9, function(i){
>     +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
>     + })
>     [1] 8 8 8 8 6 6 8 6 6
>
>     Here "8" gains 5 votes against 4 for "6". There
were two defector methods
>     that changed the point number and should be discarded. Which leaves us
>     with the score 3:4, still in favor of "6" but the default
method should prevail
>     in my sens.
>
>     Serguei.
>
>
>
>
> -- 
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

-- 
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: sokol at insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

Serguei Sokol

2017-May-31 16:46 UTC

head link

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Le 31/05/2017 ? 17:30, Serguei Sokol a ?crit :>
> More thorough reading revealed that I have overlooked this phrase in the
> line's doc: "left and right /thirds/ of the data" (emphasis
is mine).Oops. I have read the first ref returned by google and it happened to be
tibco's doc, not the R's one. The layout is very similar hence my
mistake.
The latter does not mention "thirds" but ...
Anyway, here is a new line's patch which still gives a result slightly
different
form MMline(). The slope is the same but not the intercept.
What are the exact terms for intercept calculation that should be implemented?

Serguei.
-------------- next part --------------
--- line.c.orig	2016-03-17 00:03:03.000000000 +0100
+++ line.c	2017-05-31 18:24:55.330030280 +0200
@@ -17,7 +17,7 @@
  *  https://www.R-project.org/Licenses/
  */
 
-#include <R_ext/Utils.h>	/* R_rsort() */
+#include <R_ext/Utils.h>    /* R_rsort() */
 #include <math.h>
 
 #include <Rinternals.h>
@@ -25,8 +25,8 @@
 
 /* Speed up by `inlining' these (as macros) [since R version 1.2] : */
 #if 1
-#define il(n,x)	(int)floor((n - 1) * x)
-#define iu(n,x)	(int) ceil((n - 1) * x)
+#define il(n,x) (int)floor(((n) - 1) * (x))
+#define iu(n,x) (int) ceil(((n) - 1) * (x))
 
 #else
 static int il(int n, double x)
@@ -41,71 +41,68 @@
 #endif
 
 static void line(double *x, double *y, /* input (x[i],y[i])s */
-		 double *z, double *w, /* work and output: resid. & fitted */
-		 /* all the above of length */ int n,
-		 double coef[2])
+         double *z, double *w, /* work and output: resid. & fitted */
+         int *indx, /* to get thirds of points independently of repeated values
*/
+         /* all the above of length */ int n,
+         double coef[2])
 {
     int i, j, k;
     double xb, x1, x2, xt, yt, yb, tmp1, tmp2;
     double slope, yint;
 
     for(i = 0 ; i < n ; i++) {
-	z[i] = x[i];
-	w[i] = y[i];
+        z[i] = x[i];
+        w[i] = y[i];
+        indx[i] = i;
     }
-    R_rsort(z, n);/* z = ordered abscissae */
+    rsort_with_index(z, indx, n);/* z = ordered abscissae */
 
-    tmp1 = z[il(n, 1./6.)];
-    tmp2 = z[iu(n, 1./6.)];	xb = 0.5*(tmp1+tmp2);
-
-    tmp1 = z[il(n, 2./6.)];
-    tmp2 = z[iu(n, 2./6.)];	x1 = 0.5*(tmp1+tmp2);
-
-    tmp1 = z[il(n, 4./6.)];
-    tmp2 = z[iu(n, 4./6.)];	x2 = 0.5*(tmp1+tmp2);
-
-    tmp1 = z[il(n, 5./6.)];
-    tmp2 = z[iu(n, 5./6.)];	xt = 0.5*(tmp1+tmp2);
+    k=(n+1)/3;
+    tmp1 = z[il(k, 0.5)];
+    tmp2 = z[iu(k, 0.5)];
+    /* xb := Median(first third of x) */
+    xb = 0.5*(tmp1+tmp2);
+
+    tmp1 = z[n-k+il(k, 0.5)];
+    tmp2 = z[n-k+iu(k, 0.5)];
+    /* xt := Median(last third of x) */
+    xt = 0.5*(tmp1+tmp2);
 
     slope = 0.;
 
     for(j = 1 ; j <= 1 ; j++) {
-	/* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */
-	k = 0;
-	for(i = 0 ; i < n ; i++)
-	    if(x[i] <= x1)
-		z[k++] = w[i];
-	R_rsort(z, k);
-	yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
-
-	/* yt := Median(y[i]; x[i] >= quantile(x, 2/3) */
-	k = 0;
-	for(i = 0 ; i < n ; i++)
-	    if(x[i] >= x2)
-		z[k++] = w[i];
-	R_rsort(z,k);
-	yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
-
-	slope += (yt - yb)/(xt - xb);
-	for(i = 0 ; i < n ; i++) {
-	    z[i] = y[i] - slope*x[i];
-	    /* never used: w[i] = z[i]; */
-	}
-	R_rsort(z,n);
-	yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]);
+        /* yb := Median(y[i]; x[i] in first third of x) */
+        for(i = 0 ; i < k ; i++)
+            z[i] = w[indx[i]];
+        R_rsort(z, k);
+        yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
+
+        /* yt := Median(y[i]; x[i] in last third of x */
+        for(i = 0 ; i < k ; i++)
+            z[i] = w[indx[n-k+i]];
+        R_rsort(z,k);
+        yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
+
+        slope += (yt - yb)/(xt - xb);
+        for(i = 0 ; i < n ; i++) {
+            z[i] = y[i] - slope*x[i];
+            /* never used: w[i] = z[i]; */
+        }
+        R_rsort(z,n);
+        yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]);
     }
     for( i = 0 ; i < n ; i++ ) {
-	w[i] = yint + slope*x[i];
-	z[i] = y[i] - w[i];
+        w[i] = yint + slope*x[i];
+        z[i] = y[i] - w[i];
     }
     coef[0] = yint;
     coef[1] = slope;
 }
 
-void tukeyline0(double *x, double *y, double *z, double *w, int *n,
-	       double *coef)
+void tukeyline0(double *x, double *y, double *z, double *w, int *indx, int *n,
+           double *coef)
 {
-    line(x, y, z, w, *n, coef);
+    line(x, y, z, w, indx, *n, coef);
 }
 
 SEXP tukeyline(SEXP x, SEXP y, SEXP call)
@@ -127,7 +124,8 @@
     SET_VECTOR_ELT(ans, 2, res);
     SEXP fit = allocVector(REALSXP, n);
     SET_VECTOR_ELT(ans, 3, fit);
-    line(REAL(x), REAL(y), REAL(res), REAL(fit), n, REAL(coef));
+    SEXP indx = allocVector(INTSXP, n);
+    line(REAL(x), REAL(y), REAL(res), REAL(fit), INTEGER(indx), n, REAL(coef));
     UNPROTECT(1);
     return ans;
 }

Possibly Parallel Threads

Search for more reasonably related threads

R devel - May 2017 - stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Possibly Parallel Threads