thr3ads.net - R devel - [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3 [May 2017]

If this information is useful, please help other people find it:
Share via:

Joris Meys

2017-May-31 13:40 UTC

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

OTOH,
> sapply(1:9, function(i){+   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
+ })
[1] 8 8 6 6 6 6 8 6 6

Only the default (type = 7) and the first two types give the result lines()
gives now. I think there is plenty of reasons to give why any of the other
6 types might be better suited in Tukey's method.

So to my mind, chaning the definition of line() to give sensible output
that is in accordance with the theory, does not imply any inconsistency
with the quantile definition in R. At least not with 6 out of the 9
different ones ;-)

Cheers
Joris

On Wed, May 31, 2017 at 3:06 PM, Serguei Sokol <sokol at insa-toulouse.fr>
wrote:
> Le 30/05/2017 ? 18:51, Martin Maechler a ?crit :
>
>> Serguei Sokol <sokol at insa-toulouse.fr>
>>>>>>>      on Tue, 30 May 2017 16:01:17 +0200 writes:
>>>>>>>
>>>>>>      > Le 30/05/2017 ? 09:33, Martin Maechler a
?crit : ...
>>      >> However, even after the patch, The example from the SO
>>      >> post differs from the result of Richie Cotton's
>>      >> function...
>>      > The explanation is quite simple. In SO function, the first
>>      > 1/3 quantile of used example counts 6 points (of 19 in
>>      > total), while line()'s definition of quantile leads to 8
>>      > points. The same numbers (6 and 8) are on the other end of
>>      > sample.
>>
>> so the number of obs. for the three thirds for line() are
>>     {8, 3, 8}  in line()  [also, after your patch, right?]
>>
>> whereas in MMline() they are as they should be, namely
>>
>>     {6, 7, 6}
>>
>> But the  {8, 3, 8}  split is not at all what all "the
literature",
>> including Tukey himself says that "should" be done.
>> (Other literature on the topic suggests that the optimal sizes
>>   of the split in three groups depends on the distribution of x ..)
>>
>> OTOH, MMline() does exactly what "the literature" and also 
the
>> reference on the  ?line  help pages says.
>>
> Well, what I have seen so far in "literature" was mention of 1/3
quantiles
> (but, yes I could overlook smth as I did not spend too much time on it)
> So the sample distribution in three groups boils down to a particular
> quantile
> definition to use. It turns out that the line()'s version (you are
right,
> _after_ the patch
> but my patch left this definition untouched) is consistent with the R's
> one.
> If you do in R sum(dfr$time <= quantile(dfr$time, 1./3.)) you get 8, not
6
> (and the same on the 2/3 end).
> To my mind, consistency with the rest of R, namely with the quantile
> definition,
> is an argument good enough to let the line()'s definition as is.
>
> Serguei.
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Serguei Sokol

2017-May-31 14:03 UTC

head link

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Le 31/05/2017 ? 15:40, Joris Meys a ?crit :> OTOH,
>
> > sapply(1:9, function(i){
> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
> + })
> [1] 8 8 6 6 6 6 8 6 6
>
> Only the default (type = 7) and the first two types give the result lines()
gives now. I think there is plenty of reasons to give why any of the other 6
types
> might be better suited in Tukey's method.
>
> So to my mind, chaning the definition of line() to give sensible output
that is in accordance with the theory, does not imply any inconsistency with the
> quantile definition in R. At least not with 6 out of the 9 different ones
;-)Nice shot.
But OTOE (on the other end ;)
 > sapply(1:9, function(i){
+   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
+ })
[1] 8 8 8 8 6 6 8 6 6

Here "8" gains 5 votes against 4 for "6". There were two
defector methods
that changed the point number and should be discarded. Which leaves us
with the score 3:4, still in favor of "6" but the default method
should prevail
in my sens.

Serguei.

Joris Meys

2017-May-31 14:39 UTC

head link

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Seriously, if a method gives a wrong result, it's wrong. line() does NOT
implement the algorithm of Tukey, even not after the patch. We're not
discussing Excel here, are we?

The method of Tukey is rather clear, and it is NOT using the default
quantile definition from the quantile function. Actually, it doesn't even
use quantiles to define the groups. It just says that the groups should be
more or less equally spaced. As the method of Tukey relies on the medians
of the subgroups, it would make sense to pick a method that is
approximately unbiased with regard to the median. That would be type 8
imho.

To get the size of the outer groups, Tukey would've been more than happy
enough with a:
> floor(length(dfr$time) / 3)[1] 6

There you have the size of your left and right group, and now we can
discuss about which median type should be used for the robust fitting.

But I can honestly not understand why anyone in his right mind would defend
a method that is clearly wrong while not working at Microsoft's spreadsheet
department.

Cheers
Joris

On Wed, May 31, 2017 at 4:03 PM, Serguei Sokol <sokol at insa-toulouse.fr>
wrote:
> Le 31/05/2017 ? 15:40, Joris Meys a ?crit :
>
>> OTOH,
>>
>> > sapply(1:9, function(i){
>> +   sum(dfr$time <= quantile(dfr$time, 1./3., type = i))
>> + })
>> [1] 8 8 6 6 6 6 8 6 6
>>
>> Only the default (type = 7) and the first two types give the result
>> lines() gives now. I think there is plenty of reasons to give why any
of
>> the other 6 types might be better suited in Tukey's method.
>>
>> So to my mind, chaning the definition of line() to give sensible output
>> that is in accordance with the theory, does not imply any inconsistency
>> with the quantile definition in R. At least not with 6 out of the 9
>> different ones ;-)
>>
> Nice shot.
> But OTOE (on the other end ;)
> > sapply(1:9, function(i){
> +   sum(dfr$time >= quantile(dfr$time, 2./3., type = i))
> + })
> [1] 8 8 8 8 6 6 8 6 6
>
> Here "8" gains 5 votes against 4 for "6". There were
two defector methods
> that changed the point number and should be discarded. Which leaves us
> with the score 3:4, still in favor of "6" but the default method
should
> prevail
> in my sens.
>
> Serguei.
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel :  +32 (0)9 264 61 79
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more reasonably related threads

R devel - May 2017 - stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Reasonably Related Threads