thr3ads.net - R devel - [Rd] cwilcox - new version [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Aidan Lakshman

2024-Jan-16 20:29 UTC

[Rd] cwilcox - new version

I?ve been looking at this for a couple hours--it ended up being trickier than I
expected to implement well.

I?ve attached a new patch here. This version scales significantly better than
the existing method for both `pwilcox` and `dwilcox`. Examples are included
below.

I can?t think of any other ways to improve runtime at this point. That?s not to
say there aren?t any, though?I?m hopeful inspection by someone else could
identify more ways to save some time.

This version uses Andreas?s algorithm for computing the value of `cwilcox`, but
evaluates it lazily and caches results. Memory usage of this should be orders of
magnitude less than the current version, and recursion is eliminated. As far as
I can tell, the numerical accuracy of the results is unchanged.

Performance statistics are interesting. If we assume the two populations have a
total of `m` members, then this implementation runs slightly slower for m <
20, and much slower for 50 < m < 100. However, this implementation works
significantly *faster* for m > 200. The breakpoint is precisely when each
population has a size of 50; `qwilcox(0.5,50,50)` runs in 8 microseconds in the
current version, but `qwilcox(0.5, 50, 51)` runs in 5 milliseconds. The new
version runs in roughly 1 millisecond for both. This is probably because of
internal logic that requires many more `free/calloc` calls if either population
is larger than `WILCOX_MAX`, which is set to 50.

I?m hopeful that this can be optimized to be suitable for inclusion in R. Lower
performance for population sizes below 50 is not ideal, since `wilcox.test`
switches to non-exact testing for population sizes above 50.

-Aidan

Benchmarking results on my machine using `microbenchmark`:
(median runtimes over 100 iterations, us=microseconds, ms=milliseconds,
s=seconds)

`qwilcox(0.5, n, n)`

              n:     10      25     50     100     200
    old version:  1.2us   2.9us  9.0us  87.4ms    2.3s
Andreas version:  2.7us  68.6us  1.1ms  16.9ms 264.3ms

`dwilcox((n*n)%/%2, n, n)`
              n:     10      25     50     100     200
    old version:  1.4us   0.9us  0.9us  43.2ms 851.6ms
Andreas version:  2.3us  53.9us  1.0ms  16.4ms 261.7ms


`pwilcox(1:100, 10, 10)`:
    old version:  62.9us
Andreas version:  22.9us


-----------------------
Aidan Lakshman (he/him)
PhD Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
http://www.ahl27.com/
ahl27 at pitt.edu | (724) 612-9940

On 16 Jan 2024, at 9:47, Aidan Lakshman wrote:
> New email client, hoping that Outlook won?t continue to HTMLify my emails
and mess up my attachments without my consent :/
>
> I?m re-attaching the patch file here, compressed as a .gz in case that was
the issue with the previous attachment. Thanks to Ivan for pointing out that my
attachment didn?t go through correctly on the first try.
>
> I?ll be working on seeing if I can optimize this to actually improve
performance during the day today.
>
> -Aidan
>
> On 15 Jan 2024, at 5:51, Andreas L?ffler wrote:
>
>> I am a newbie to C (although I did some programming in Perl and Pascal)
so
>> forgive me for the language that follows. I am writing because I have a
>> question concerning the *implementation of *(the internal function)*
>> cwilcox* in
>>
https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c.
>> This function is used to determine the test statistics of the
>> Wilcoxon-Mann-Whitney U-test.
>>
>>
_______________________________________________________________________________________________________________________________________
>> The function `cwilcox` has three inputs: k, m and n. To establish the
test
>> statistics `cwilcox` determines the number of possible sums of natural
>> numbers with the following restrictions:
>> - the sum of the elements must be k,
>> - the elements (summands) must be in a decreasing (or increasing)
order,
>> - every element must be less then m and
>> - there must be at most n summands.
>>
>> The problem with the evaluation of this function `cwilcox(k,m,n)` is
that
>> it requires a recursion where in fact a three-dimensional array has to
be
>> evaluated (see the code around line 157). This requires a lot of memory
and
>> takes time and seems still an issue even with newer machines, see the
>> warning in the documentation
>>
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test
>> .
>>
>> In an old paper I have written a formula where the evaluation can be
done
>> in a one-dimensional array that uses way less memory and is much
faster.
>> The paper itself is in German (you find a copy here:
>>
https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf),
>> so I uploaded a translation into English (to be found in
>> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf).
>>
>> I also looked at the code in R and wrote a new version of the code that
>> uses my formulae so that a faster implementation is possible (code with
>> comments is below). I have several questions regarding the code:
>>
>>    1. A lot of commands in the original R code are used to handle the
>>    memory. I do not know how to do this so I skipped memory handling
>>    completely and simply allocated space to an array (see below int
>>    cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead
of
>>    3-dimensional I think as a first guess that will be ok.
>>    2. I read the documentation when and why code in R should be
changed. I
>>    am not familiar enough with R to understand how this applies here.
My code
>>    uses less memory - is that enough? Or should another function be
defined
>>    that uses my code? Or shall it be disregarded?
>>    3. I was not able to file my ideas in bugzilla.r-project or the like
and
>>    I hope that this mailing list is a good address for my question.
>>
>> I also have written an example of a main function where the original
code
>> from R is compared to my code. I do not attach this example because
this
>> email is already very long but I can provide that example if needed.
>>
>> Maybe we can discuss this further. Best wishes, Andreas L?ffler.
>>
>>
>> ```
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> /* The traditional approch to determine the Mann-Whitney statistics
uses a
>> recursion formular for partitions of natural numbers, namely in the
line
>> w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1);
>> (taken from
>>
https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c).
>> This approach requires a huge number of partitions to be evaluated
because
>> the second variable (j in the left term) and the third variable (k in
the
>> left term) in this recursion are not constant but change as well.
Hence, a
>> three dimensional array is evaluated.
>>
>> In an old paper a recursion equations was shown that avoids this
>> disadvantage. The recursion equation of that paper uses only an array
where
>> the second as well as the third variable remain constant. This implies
>> faster evaluation and less memory used. The original paper is in German
and
>> can be found in
>>
https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf
>> and the author has uploaded a translation into English in
>> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf.
This
>> function uses this approach. */
>>
>> static int
>> cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma
function
>> below */
>>   int s, d;
>>
>>   s=0;
>>   for (d = 1; d <= m; d++) {
>>   if (k%d == 0) {
>>   s=s+d;
>>   }
>>   }
>>   for (d = n+1; d <= n+m; d++) {
>>   if (k%d == 0) {
>>   s=s-d;
>>   }
>>   }
>>   return s;
>> }
>>
>> /* this can replace cwilcox. It runs faster and uses way less memory */
>> static double
>> cwilcox2(int k, int m, int n){
>>
>> int cwilcox_distr[(m*n+1)/2]; /* will store (one half of the)
distribution
>> */
>> int s, i, kk;
>>
>> if (2*k>m*n){
>> k=m*n-k; /* permutation function is symmetric */
>> }
>>
>> for (kk=0; 2*kk<=m*n+1; kk++){
>> if (kk==0){
>> cwilcox_distr[0]=1; /* by definition 0 has only 1 partition */
>> } else {
>> s=0;
>> for (i = 0; i<kk; i++){
>> s=s+cwilcox_distr[i]*cwilcox_sigma(kk-i,m,n); /* recursion formula */
>>    }
>>    cwilcox_distr[kk]=s/kk;
>>    }
>> }
>>
>> return (double) cwilcox_distr[k];
>> }
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel-------------- next part --------------
A non-text attachment was scrubbed...
Name: wilcox_patch_draft_v2.gz
Type: application/x-gzip
Size: 2652 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20240116/51c0bd7c/attachment.bin>

Andreas Löffler

2024-Jan-17 10:55 UTC

head link

[Rd] cwilcox - new version

>
>
> Performance statistics are interesting. If we assume the two populations
> have a total of `m` members, then this implementation runs slightly slower
> for m < 20, and much slower for 50 < m < 100. However, this
implementation
> works significantly *faster* for m > 200. The breakpoint is precisely
when
> each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8
> microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5
> milliseconds. The new version runs in roughly 1 millisecond for both. This
> is probably because of internal logic that requires many more `free/calloc`
> calls if either population is larger than `WILCOX_MAX`, which is set to 50.
>Also because cwilcox_sigma has to be evaluated, and this is slightly more
demanding since it uses k%d.

There is a tradeoff here between memory usage and time of execution. I am
not a heavy user of the U test but I think the typical use case does not
involve several hundreds of tests in a session so execution time (my 2
cents) is less important. But if R crashes one execution is already
problematic.

But the takeaway is  probably: we should implement both approaches in the
code and leave it to the user which one she prefers. If time is important
and memory not an issue and if m, n are low go for the "traditional
approach". Otherwise, use my formula?

PS (@Aidan): I have applied for an bugzilla account two days ago and heard
not back from them. Also Spam is empty. Is that ok or shall I do something?

	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jan 2024 - cwilcox - new version

[Rd] cwilcox - new version

[Rd] cwilcox - new version

Seemingly Similar Threads