thr3ads.net - R devel - [Rd] max.col oddity [Dec 2006]

If this information is useful, please help other people find it:
Share via:

John Zedlewski

2006-Dec-16 00:09 UTC

[Rd] max.col oddity

I've noticed that the max.col function with the default "random"
option often gives unexpected results. For instance, in this test, it
seems clear what the answer should be:
> # second col should always be max
> x1 = cbind(1:10, 2:11, -Inf)
>
> # this works fine
> max.col(x1, "first")
 [1] 2 2 2 2 2 2 2 2 2 2>
> # this gives random answers
> max.col(x1)
> [1] 3 1 1 2 3 3 1 3 1 1
Ouch! max.col is randomizing across all values.
Even without infinite values, something similar can happen:
> # test 2 --- tolerance problems
>
> # clearly column 3 is the max
> x1 = cbind(-1e9 * 1:10, 1:10, 2:11)
>
> # again, first method works:
> max.col(x1, "first")
 [1] 3 3 3 3 3 3 3 3 3 3>
> # but random doesn't
> max.col(x1)
 [1] 2 3 2 3 3 2 2 2 3 2>
The max.col docs say " there is a relative tolerance of 1e-5, relative
to the largest entry in the row", but it's really using the maximum
absolute value entry in the row (appl/maxcol.c, line 35 in R 2.4.0).
Is this necessary for some sort of S-plus compatibility? If so, I
think it would be good to make this absolute value property very clear
in the docs, since it can cause subtle bugs (and did for me).

Personally, I think the behavior is much nicer with the following patch:

--- rplain/R-2.4.0/src/appl/maxcol.c    2006-04-09 18:19:58.000000000 -0400
+++ R-2.4.0/src/appl/maxcol.c   2006-12-14 15:30:56.000000000 -0500
@@ -26,13 +26,14 @@
       do_rand = *ties_meth == 1;

    for (r = 0; r < n_r; r++) {
-       /* first check row for any NAs and find the largest abs(entry) */
+       /* first check row for any NAs and find the largest entry */
       large = 0.0;
       isna = FALSE;
       for (c = 0; c < *nc; c++) {
           a = matrix[r + c * n_r];
           if (ISNAN(a)) { isna = TRUE; break; }
-           if (do_rand) large = fmax2(large, fabs(a));
+           if (!R_FINITE(a)) continue;
+           if (do_rand) large = fmax2(large, a);
       }
       if (isna) { maxes[r] = NA_INTEGER; continue; }

---------------- END   ----------------------

This gives the expected behavior in the two examples above.

(Sorry to crosspost to both this list and R-help, but it was suggested
that R-devel would be a more appropriate forum for this.)

Duncan Murdoch

2006-Dec-16 11:37 UTC

head link

[Rd] max.col oddity

On 12/15/2006 7:09 PM, John Zedlewski wrote:> I've noticed that the max.col function with the default
"random"
> option often gives unexpected results. For instance, in this test, it
> seems clear what the answer should be:
> 
>> # second col should always be max
>> x1 = cbind(1:10, 2:11, -Inf)
>>
>> # this works fine
>> max.col(x1, "first")
>  [1] 2 2 2 2 2 2 2 2 2 2
>> # this gives random answers
>> max.col(x1)
>> [1] 3 1 1 2 3 3 1 3 1 1
> 
> Ouch! max.col is randomizing across all values.
> Even without infinite values, something similar can happen:
> 
>> # test 2 --- tolerance problems
>>
>> # clearly column 3 is the max
>> x1 = cbind(-1e9 * 1:10, 1:10, 2:11)
>>
>> # again, first method works:
>> max.col(x1, "first")
>  [1] 3 3 3 3 3 3 3 3 3 3
>> # but random doesn't
>> max.col(x1)
>  [1] 2 3 2 3 3 2 2 2 3 2
> 
> The max.col docs say " there is a relative tolerance of 1e-5, relative
> to the largest entry in the row", but it's really using the
maximum
> absolute value entry in the row (appl/maxcol.c, line 35 in R 2.4.0).
> Is this necessary for some sort of S-plus compatibility? If so, I
> think it would be good to make this absolute value property very clear
> in the docs, since it can cause subtle bugs (and did for me).
> 
> Personally, I think the behavior is much nicer with the following patch:
> 
> --- rplain/R-2.4.0/src/appl/maxcol.c    2006-04-09 18:19:58.000000000 -0400
> +++ R-2.4.0/src/appl/maxcol.c   2006-12-14 15:30:56.000000000 -0500
> @@ -26,13 +26,14 @@
>        do_rand = *ties_meth == 1;
> 
>     for (r = 0; r < n_r; r++) {
> -       /* first check row for any NAs and find the largest abs(entry) */
> +       /* first check row for any NAs and find the largest entry */
>        large = 0.0;
>        isna = FALSE;
>        for (c = 0; c < *nc; c++) {
>            a = matrix[r + c * n_r];
>            if (ISNAN(a)) { isna = TRUE; break; }
> -           if (do_rand) large = fmax2(large, fabs(a));
> +           if (!R_FINITE(a)) continue;
> +           if (do_rand) large = fmax2(large, a);
>        }
>        if (isna) { maxes[r] = NA_INTEGER; continue; }
> 
> ---------------- END   ----------------------
> 
> This gives the expected behavior in the two examples above.
I think I'd prefer to be clear about what "largest" means in the
docs
rather than dropping the absolute value, because if all entries are 
negative, your version sets the tolerance to 0.

But ignoring infinite values when calculating the largest absolute value 
might be a good idea:  I can't see why someone would want an infinite 
tolerance.

So I'd propose to add your R_FINITE check, and make this change to the docs:

" there is a relative tolerance of 1e-5, relative
   to the largest finite absolute value in the row"

The copyright notice in the file says this function is from MASS/MASS.c 
by W. N. Venables and B. D. Ripley, so I'd like to hear from at least 
one of them before making a change.

Duncan Murdoch
> 
> (Sorry to crosspost to both this list and R-help, but it was suggested
> that R-devel would be a more appropriate forum for this.)
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - Dec 2006 - max.col oddity

[Rd] max.col oddity

[Rd] max.col oddity

Possibly Parallel Threads