thr3ads.net - R devel - [Rd] Inconsistencies in wilcox.test [Dec 2019]

If this information is useful, please help other people find it:
Share via:

Karolis Koncevičius

2019-Dec-07 21:18 UTC

[Rd] Inconsistencies in wilcox.test

Thank you for a fast response. Nice to see this mailing list being so 
alive.

Regarding Inf issue: I agree with your assessment that Inf should not be 
removed. The code gave me an impression that Inf values were 
intentionally removed (since is.finite() was used everywhere, except for 
paired case). I will try to adjust my patch according to your feedback.

One more thing: it seems like you assumed that issues 2:4 are all 
related to machine precision, which is not the case - only 2nd issue is.
Just wanted to draw this to your attention, in case you might have some 
feedback and guidelines about issues 3 and 4 as well.



On 2019-12-07 21:59, Martin Maechler wrote:>>>>>> Karolis Koncevi?ius
>>>>>>     on Sat, 7 Dec 2019 20:55:36 +0200 writes:
>
>    > Hello,
>    > Writing to share some things I've found about wilcox.test()
that seem a
>    > a bit inconsistent.
>
>    > 1. Inf values are not removed if paired=TRUE
>
>    > # returns different results (Inf is removed):
>    > wilcox.test(c(1,2,3,4), c(0,9,8,7))
>    > wilcox.test(c(1,2,3,4), c(0,9,8,Inf))
>
>    > # returns the same result (Inf is left as value with highest rank):
>    > wilcox.test(c(1,2,3,4), c(0,9,8,7), paired=TRUE)
>    > wilcox.test(c(1,2,3,4), c(0,9,8,Inf), paired=TRUE)
>
>Now which of the two cases do you consider correct ?
>
>IHMO, the 2nd one is correct: it is exactly one property of the
>*robustness* of the wilcoxon test and very desirable that any
>(positive) outlier is treated the same as just "the largest
>value" and the test statistic (and hence the p-value) should not
>change.
>
>So I think the first case is wrong, notably if modified, (such
>that the last  y  is the overall maximal value (slightly larger sample):
>
>> wilcox.test(1:7, 1/8+ c(9:4, 12))
>
>	Wilcoxon rank sum test
>
>data:  1:7 and 1/8 + c(9:4, 12)
>W = 6, p-value = 0.01748
>alternative hypothesis: true location shift is not equal to 0
>
>> wilcox.test(1:7, 1/8+ c(9:4, 10000))
>
>	Wilcoxon rank sum test
>
>data:  1:7 and 1/8 + c(9:4, 10000)
>W = 6, p-value = 0.01748
>alternative hypothesis: true location shift is not equal to 0
>
>> wilcox.test(1:7, 1/8+ c(9:4, Inf))
>
>	Wilcoxon rank sum test
>
>data:  1:7 and 1/8 + c(9:4, Inf)
>W = 6, p-value = 0.03497
>alternative hypothesis: true location shift is not equal to 0
>
>The  Inf  case should definitely give the same as the  10'000 case.
>That's exactly one property of a robust statistic.
>
>Thank you, Karolis, this is pretty embarrassing to only be detected now
after 25+ years of R in use ...
>
>The correct fix starts with replacing the  is.finite()  by   !is.na() and
keep the 'Inf' in the rank computations...
>(but then probably also deal with the case of more than one Inf, notably the
Inf - Inf  "exception" which is not triggered by your example...)
>
>
>---
>
>Ben addressed the "rounding" / numerical issues unavoidable for
>the other problems.
>
>    > 2. tolerance issues with paired=TRUE.
>
>    > wilcox.test(c(4, 3, 2), c(3, 2, 1), paired=TRUE)
>    > # ...
>    > # Warning:  cannot compute exact p-value with ties
>
>    > wilcox.test(c(0.4,0.3,0.2), c(0.3,0.2,0.1), paired=TRUE)
>    > # ...
>    > # no warning
>
>    > 3. Always 'x observations are missing' when paired=TRUE
>
>    > wilcox.test(c(1,2), c(NA_integer_,NA_integer_), paired=TRUE)
>    > # ...
>    > # Error:  not enough (finite) 'x' observations
>
>    > 4. No indication if normal approximation was used:
>
>    > # different numbers, but same "method" name
>    > wilcox.test(rnorm(10), exact=FALSE, correct=FALSE)
>    > wilcox.test(rnorm(10), exact=TRUE, correct=FALSE)
>
>
>    > From all of these I am pretty sure the 1st one is likely
unintended,
>    > so attaching a small patch to adjust it. Can also try patching
others if
>    > consensus is reached that the behavioiur has to be modified.
>
>    > Kind regards,
>    > Karolis Koncevi?ius.
>
>    > ---
>
>    > Index: wilcox.test.R
>    >
==================================================================>    >
--- wilcox.test.R  (revision 77540)
>    > +++ wilcox.test.R  (working copy)
>    > @@ -42,7 +42,7 @@
>    > if(paired) {
>    > if(length(x) != length(y))
>    > stop("'x' and 'y' must have the same
length")
>    > -            OK <- complete.cases(x, y)
>    > +            OK <- is.finite(x) & is.finite(y)
>    > x <- x[OK] - y[OK]
>    > y <- NULL
>    > }
>
>    > ______________________________________________
>    > R-devel at r-project.org mailing list
>    > https://stat.ethz.ch/mailman/listinfo/r-devel

Karolis Koncevičius

2019-Dec-09 21:43 UTC

head link

[Rd] Inconsistencies in wilcox.test

So I tried adding Infinity support for all cases.
And it is (as could be expected) more complicated than I thought.

It is easy to add Inf support for the test. The problems start with
conf.int=TRUE.

Currently confidence intervals are computed via `uniroot()` and, in the 
case of infinities, we are computationally looking for roots over 
infinite interval which results in an error. I suspect this is the 
reason Inf values were removed in the first place.

Just a note, I found a few more errors/inconsistencies when requesting 
confidence intervals with paired=TRUE (due to Infinities being left in).

Current error in Inf-Inf scenario:

wilcox.test(c(1,2,Inf), c(4,8,Inf), paired=TRUE, conf.int=TRUE)
   Error in if (ZEROES) x <- x[x != 0] :
     missing value where TRUE/FALSE needed



NaN confidence intervals:

wilcox.test(c(1:9,Inf), c(21:28,Inf,30), paired=TRUE, conf.int=TRUE)

   Wilcoxon signed rank test with continuity correction

   data:  c(1:9, Inf) and c(21:28, Inf, 30)
   V = 9.5, p-value = 0.0586
   alternative hypothesis: true location shift is not equal to 0
   0 percent confidence interval:
    NaN NaN
    sample estimates:
    midrange
         NaN


The easiest "fix" for consistency would be to simply remove Infinity 
support from the paired=TRUE case.

But going with the more desirable approach of adding Infinity support 
for non-paired cases - it is currently not clear to me what confidence 
intervals and pseudomedian should be. Specially when Infinities are on 
both sides.

Regards,
Karolis Koncevi?ius.


On 2019-12-07 23:18, Karolis Koncevi?ius wrote:>Thank you for a fast response. Nice to see this mailing list being so 
>alive.
>
>Regarding Inf issue: I agree with your assessment that Inf should not 
>be removed. The code gave me an impression that Inf values were 
>intentionally removed (since is.finite() was used everywhere, except 
>for paired case). I will try to adjust my patch according to your 
>feedback.
>
>One more thing: it seems like you assumed that issues 2:4 are all 
>related to machine precision, which is not the case - only 2nd issue 
>is.
>Just wanted to draw this to your attention, in case you might have 
>some feedback and guidelines about issues 3 and 4 as well.
>
>
>
>On 2019-12-07 21:59, Martin Maechler wrote:
>>>>>>>Karolis Koncevi?ius
>>>>>>>    on Sat, 7 Dec 2019 20:55:36 +0200 writes:
>>
>>   > Hello,
>>   > Writing to share some things I've found about wilcox.test()
that seem a
>>   > a bit inconsistent.
>>
>>   > 1. Inf values are not removed if paired=TRUE
>>
>>   > # returns different results (Inf is removed):
>>   > wilcox.test(c(1,2,3,4), c(0,9,8,7))
>>   > wilcox.test(c(1,2,3,4), c(0,9,8,Inf))
>>
>>   > # returns the same result (Inf is left as value with highest
rank):
>>   > wilcox.test(c(1,2,3,4), c(0,9,8,7), paired=TRUE)
>>   > wilcox.test(c(1,2,3,4), c(0,9,8,Inf), paired=TRUE)
>>
>>Now which of the two cases do you consider correct ?
>>
>>IHMO, the 2nd one is correct: it is exactly one property of the
>>*robustness* of the wilcoxon test and very desirable that any
>>(positive) outlier is treated the same as just "the largest
>>value" and the test statistic (and hence the p-value) should not
>>change.
>>
>>So I think the first case is wrong, notably if modified, (such
>>that the last  y  is the overall maximal value (slightly larger sample):
>>
>>>wilcox.test(1:7, 1/8+ c(9:4, 12))
>>
>>	Wilcoxon rank sum test
>>
>>data:  1:7 and 1/8 + c(9:4, 12)
>>W = 6, p-value = 0.01748
>>alternative hypothesis: true location shift is not equal to 0
>>
>>>wilcox.test(1:7, 1/8+ c(9:4, 10000))
>>
>>	Wilcoxon rank sum test
>>
>>data:  1:7 and 1/8 + c(9:4, 10000)
>>W = 6, p-value = 0.01748
>>alternative hypothesis: true location shift is not equal to 0
>>
>>>wilcox.test(1:7, 1/8+ c(9:4, Inf))
>>
>>	Wilcoxon rank sum test
>>
>>data:  1:7 and 1/8 + c(9:4, Inf)
>>W = 6, p-value = 0.03497
>>alternative hypothesis: true location shift is not equal to 0
>>
>>The  Inf  case should definitely give the same as the  10'000 case.
>>That's exactly one property of a robust statistic.
>>
>>Thank you, Karolis, this is pretty embarrassing to only be detected now
after 25+ years of R in use ...
>>
>>The correct fix starts with replacing the  is.finite()  by   !is.na()
and keep the 'Inf' in the rank computations...
>>(but then probably also deal with the case of more than one Inf, notably
the  Inf - Inf  "exception" which is not triggered by your example...)
>>
>>
>>---
>>
>>Ben addressed the "rounding" / numerical issues unavoidable
for
>>the other problems.
>>
>>   > 2. tolerance issues with paired=TRUE.
>>
>>   > wilcox.test(c(4, 3, 2), c(3, 2, 1), paired=TRUE)
>>   > # ...
>>   > # Warning:  cannot compute exact p-value with ties
>>
>>   > wilcox.test(c(0.4,0.3,0.2), c(0.3,0.2,0.1), paired=TRUE)
>>   > # ...
>>   > # no warning
>>
>>   > 3. Always 'x observations are missing' when paired=TRUE
>>
>>   > wilcox.test(c(1,2), c(NA_integer_,NA_integer_), paired=TRUE)
>>   > # ...
>>   > # Error:  not enough (finite) 'x' observations
>>
>>   > 4. No indication if normal approximation was used:
>>
>>   > # different numbers, but same "method" name
>>   > wilcox.test(rnorm(10), exact=FALSE, correct=FALSE)
>>   > wilcox.test(rnorm(10), exact=TRUE, correct=FALSE)
>>
>>
>>   > From all of these I am pretty sure the 1st one is likely
unintended,
>>   > so attaching a small patch to adjust it. Can also try patching
others if
>>   > consensus is reached that the behavioiur has to be modified.
>>
>>   > Kind regards,
>>   > Karolis Koncevi?ius.
>>
>>   > ---
>>
>>   > Index: wilcox.test.R
>>   >
==================================================================>>  
> --- wilcox.test.R  (revision 77540)
>>   > +++ wilcox.test.R  (working copy)
>>   > @@ -42,7 +42,7 @@
>>   > if(paired) {
>>   > if(length(x) != length(y))
>>   > stop("'x' and 'y' must have the same
length")
>>   > -            OK <- complete.cases(x, y)
>>   > +            OK <- is.finite(x) & is.finite(y)
>>   > x <- x[OK] - y[OK]
>>   > y <- NULL
>>   > }
>>
>>   > ______________________________________________
>>   > R-devel at r-project.org mailing list
>>   > https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

2019-Dec-12 16:20 UTC

head link

[Rd] Inconsistencies in wilcox.test

>>>>> Karolis Koncevi?ius 
>>>>>     on Mon, 9 Dec 2019 23:43:36 +0200 writes:
    > So I tried adding Infinity support for all cases.
    > And it is (as could be expected) more complicated than I thought.

"Of course !"   Thank you, Karolis, in any case!

    > It is easy to add Inf support for the test. The problems start with
conf.int=TRUE.

    > Currently confidence intervals are computed via `uniroot()` and, in the
    > case of infinities, we are computationally looking for roots over 
    > infinite interval which results in an error. I suspect this is the 
    > reason Inf values were removed in the first place.

Maybe. It's still wrong to be done "up front".
I'm sure 98% (or so ;-) of all calls to wilcox.test() do *not*
use  conf.int = TRUE


    > Just a note, I found a few more errors/inconsistencies when requesting 
    > confidence intervals with paired=TRUE (due to Infinities being left
in).

    > Current error in Inf-Inf scenario:

    > wilcox.test(c(1,2,Inf), c(4,8,Inf), paired=TRUE, conf.int=TRUE)
    > Error in if (ZEROES) x <- x[x != 0] :
    > missing value where TRUE/FALSE needed

Good catch .. notably as it also happens when conf.int=FALSE as
by default.
My version of wilcox.test() now does give the same as when the
to 'Inf' are left away.

    > NaN confidence intervals:

    > wilcox.test(c(1:9,Inf), c(21:28,Inf,30), paired=TRUE, conf.int=TRUE)

    > Wilcoxon signed rank test with continuity correction

    > data:  c(1:9, Inf) and c(21:28, Inf, 30)
    > V = 9.5, p-value = 0.0586
    > alternative hypothesis: true location shift is not equal to 0
    > 0 percent confidence interval:
    > NaN NaN
    > sample estimates:
    > midrange
    > NaN

I don't see a big problem here. The NaN's in some sense show the
best that can be computed with this data.  Statistic and
P-value, but no conf.int.


    > The easiest "fix" for consistency would be to simply remove
Infinity
    > support from the paired=TRUE case.

I strongly disagree.  We are not pruning good functionality just
for some definition of consistency.

    > But going with the more desirable approach of adding Infinity support 
    > for non-paired cases - it is currently not clear to me what confidence 
    > intervals and pseudomedian should be. Specially when Infinities are on 
    > both sides.

I deem that not to be a big deal.  They are not defined given
the default formulas and that is reflected by NA / NaN  in those
parts of the result.

    > Regards,
    > Karolis Koncevi?ius.

But I have also spent a few hours now on wilcox.test.default() behavior
notably also looking at the "rounding" / "machine precision"
situation, and also on your remark that the 'method: ...' does
not indicate well enough what was computed.

In my (not yet committed) but hereby proposed  enhancement of wilcox.test(),
I have a new argument,  'digits.rank = Inf'  (the default 'Inf'
corresponding to the current behavior)
with help page documentation:

digits.rank: a number; if finite, ?rank(signif(r, digits.rank))? will
          be used to compute ranks for the test statistic instead of
          (the default) ?rank(r)?.

and then in 'Details :'

     For stability reasons, it may be advisable to use rounded data or
     to set ?digits.rank = 7?, say, such that determination of ties
     does not depend on very small numeric differences (see the
     example).

and then in 'Examples: '

     ## accuracy in ties determination via 'digits.rank':
     wilcox.test( 4:2,      3:1,     paired=TRUE) # Warning:  cannot compute
exact p-value with ties
     wilcox.test((4:2)/10, (3:1)/10, paired=TRUE) # no ties => *no* warning
     wilcox.test((4:2)/10, (3:1)/10, paired=TRUE, digits.rank = 9) # same ties
as (4:2, 3:1)
     
----------------------

Lastly, I propose to replace  "test" by "exact test" in the
'method' component (and print out) in case exact computations
were used.  This information should be part of the returned
"htest" object, and not only visible from the arguments and warnings
that are
printed during the computations.
This last change is in some sense the  "most back-incompatible"
change of these,  because many  wilcox.test()  printouts would
slightly change, e.g., 

  > w0 <- wilcox.test( 1:5, 4*(0:4), paired=TRUE)

	  Wilcoxon signed rank exact test

  data:  1:5 and 4 * (0:4)
  V = 1, p-value = 0.125
  alternative hypothesis: true location shift is not equal to 0

where before (in R <= 3.6.x)  it is just

	  Wilcoxon signed rank test

  data: .........
  ...............
  ...............

but I think R 4.0.0 is a good occasion for such a change.

Constructive feedback on all this is very welcome!
Martin




    > On 2019-12-07 23:18, Karolis Koncevi?ius wrote:
    >> Thank you for a fast response. Nice to see this mailing list being
so
    >> alive.
    >> 
    >> Regarding Inf issue: I agree with your assessment that Inf should
not
    >> be removed. The code gave me an impression that Inf values were 
    >> intentionally removed (since is.finite() was used everywhere,
except
    >> for paired case). I will try to adjust my patch according to your 
    >> feedback.
    >> 
    >> One more thing: it seems like you assumed that issues 2:4 are all 
    >> related to machine precision, which is not the case - only 2nd
issue
    >> is.
    >> Just wanted to draw this to your attention, in case you might have 
    >> some feedback and guidelines about issues 3 and 4 as well.
    >> 
    >> 
    >> 
    >> On 2019-12-07 21:59, Martin Maechler wrote:
    >>>>>>>> Karolis Koncevi?ius
    >>>>>>>> on Sat, 7 Dec 2019 20:55:36 +0200 writes:
    >>> 
    >>> > Hello,
    >>> > Writing to share some things I've found about
wilcox.test() that seem a
    >>> > a bit inconsistent.
    >>> 
    >>> > 1. Inf values are not removed if paired=TRUE
    >>> 
    >>> > # returns different results (Inf is removed):
    >>> > wilcox.test(c(1,2,3,4), c(0,9,8,7))
    >>> > wilcox.test(c(1,2,3,4), c(0,9,8,Inf))
    >>> 
    >>> > # returns the same result (Inf is left as value with
highest rank):
    >>> > wilcox.test(c(1,2,3,4), c(0,9,8,7), paired=TRUE)
    >>> > wilcox.test(c(1,2,3,4), c(0,9,8,Inf), paired=TRUE)
    >>> 
    >>> Now which of the two cases do you consider correct ?
    >>> 
    >>> IHMO, the 2nd one is correct: it is exactly one property of the
    >>> *robustness* of the wilcoxon test and very desirable that any
    >>> (positive) outlier is treated the same as just "the
largest
    >>> value" and the test statistic (and hence the p-value)
should not
    >>> change.
    >>> 
    >>> So I think the first case is wrong, notably if modified, (such
    >>> that the last  y  is the overall maximal value (slightly larger
sample):
    >>> 
    >>>> wilcox.test(1:7, 1/8+ c(9:4, 12))
    >>> 
    >>> Wilcoxon rank sum test
    >>> 
    >>> data:  1:7 and 1/8 + c(9:4, 12)
    >>> W = 6, p-value = 0.01748
    >>> alternative hypothesis: true location shift is not equal to 0
    >>> 
    >>>> wilcox.test(1:7, 1/8+ c(9:4, 10000))
    >>> 
    >>> Wilcoxon rank sum test
    >>> 
    >>> data:  1:7 and 1/8 + c(9:4, 10000)
    >>> W = 6, p-value = 0.01748
    >>> alternative hypothesis: true location shift is not equal to 0
    >>> 
    >>>> wilcox.test(1:7, 1/8+ c(9:4, Inf))
    >>> 
    >>> Wilcoxon rank sum test
    >>> 
    >>> data:  1:7 and 1/8 + c(9:4, Inf)
    >>> W = 6, p-value = 0.03497
    >>> alternative hypothesis: true location shift is not equal to 0
    >>> 
    >>> The  Inf  case should definitely give the same as the 
10'000 case.
    >>> That's exactly one property of a robust statistic.
    >>> 
    >>> Thank you, Karolis, this is pretty embarrassing to only be
detected now after 25+ years of R in use ...
    >>> 
    >>> The correct fix starts with replacing the  is.finite()  by  
!is.na() and keep the 'Inf' in the rank computations...
    >>> (but then probably also deal with the case of more than one
Inf, notably the  Inf - Inf  "exception" which is not triggered by
your example...)
    >>> 
    >>> 
    >>> ---
    >>> 
    >>> Ben addressed the "rounding" / numerical issues
unavoidable for
    >>> the other problems.
    >>> 
    >>> > 2. tolerance issues with paired=TRUE.
    >>> 
    >>> > wilcox.test(c(4, 3, 2), c(3, 2, 1), paired=TRUE)
    >>> > # ...
    >>> > # Warning:  cannot compute exact p-value with ties
    >>> 
    >>> > wilcox.test(c(0.4,0.3,0.2), c(0.3,0.2,0.1), paired=TRUE)
    >>> > # ...
    >>> > # no warning
    >>> 
    >>> > 3. Always 'x observations are missing' when
paired=TRUE
    >>> 
    >>> > wilcox.test(c(1,2), c(NA_integer_,NA_integer_),
paired=TRUE)
    >>> > # ...
    >>> > # Error:  not enough (finite) 'x' observations
    >>> 
    >>> > 4. No indication if normal approximation was used:
    >>> 
    >>> > # different numbers, but same "method" name
    >>> > wilcox.test(rnorm(10), exact=FALSE, correct=FALSE)
    >>> > wilcox.test(rnorm(10), exact=TRUE, correct=FALSE)
    >>> 
    >>> 
    >>> > From all of these I am pretty sure the 1st one is likely
unintended,
    >>> > so attaching a small patch to adjust it. Can also try
patching others if
    >>> > consensus is reached that the behavioiur has to be
modified.
    >>> 
    >>> > Kind regards,
    >>> > Karolis Koncevi?ius.
    >>> 
    >>> > ---
    >>> 
    >>> > Index: wilcox.test.R
    >>> >
==================================================================   
>>> > --- wilcox.test.R  (revision 77540)
    >>> > +++ wilcox.test.R  (working copy)
    >>> > @@ -42,7 +42,7 @@
    >>> > if(paired) {
    >>> > if(length(x) != length(y))
    >>> > stop("'x' and 'y' must have the same
length")
    >>> > -            OK <- complete.cases(x, y)
    >>> > +            OK <- is.finite(x) & is.finite(y)
    >>> > x <- x[OK] - y[OK]
    >>> > y <- NULL
    >>> > }
    >>> 
    >>> > ______________________________________________
    >>> > R-devel at r-project.org mailing list
    >>> > https://stat.ethz.ch/mailman/listinfo/r-devel

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Dec 2019 - Inconsistencies in wilcox.test

[Rd] Inconsistencies in wilcox.test

[Rd] Inconsistencies in wilcox.test

[Rd] Inconsistencies in wilcox.test

Seemingly Similar Threads