thr3ads.net - R devel - [Rd] ifelse() woes ... can we agree on a ifelse2() ? [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2016-Aug-15 12:51 UTC

[Rd] ifelse() woes ... can we agree on a ifelse2() ?

On Fri, Aug 12, 2016 at 11:31 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:>>     >> One possibility would also be to consider  a
"numbers-only" or
>>     >> rather "same type"-only {e.g., would also work
for characters}
>>     >> version.
>>
>>     > I don't know what you mean by these.
>>
>> In the mean time, Bob Rudis mentioned   dplyr::if_else(),
>> which is very relevant, thank you Bob!
>>
>> As I have found, that actually works in such a "same
type"-only way:
>> It does not try to coerce, but gives an error when the classes
>> differ, even in this somewhat debatable case :
>>
>>    > dplyr::if_else(c(TRUE, FALSE), 2:3, 0+10:11)
>>    Error: `false` has type 'double' not 'integer'
>>    >
>>
>> As documented, if_else() is clearly stricter than ifelse()
>> and e.g., also does no recycling (but of length() 1).
>
> I agree that if_else() is currently too strict - it's particularly
> annoying if you want to replace some values with a missing:
>
> x <- sample(10)
> if_else(x > 5, NA, x)
> #  Error: `false` has type 'integer' not 'logical'
>
> But I would like to make sure that this remains an error:
>
> if_else(x > 5, x, "BLAH")
>
> Because that seems more likely to be a user error (but reasonable
> people might certainly believe that it should just work)
>
> dplyr is more accommodating in other places (i.e. in bind_rows(),
> collapse() and the joins) but it's surprisingly hard to get all the
> details right. For example, what should the result of this call be?
>
> if_else(c(TRUE, FALSE), factor(c("a", "b")),
factor(c("c", "b"))
>
> Strictly speaking I think you could argue it's an error, but that's
> not very user-friendly. Should it be a factor with the union of the
> levels? Should it be a character vector + warning? Should the
> behaviour change if one set of levels is a subset of the other set?
>
> There are similar issues for POSIXct (if the time zones are different,
> which should win?), and difftimes (similarly for units).  Ideally
> you'd like the behaviour to be extensible for new S3 classes, which
> suggests it should be a generic (and for the most general case, it
> would need to dispatch on both arguments).
One possible principle would be to use c() - i.e. construct out as

out <- c(yes[0], no[0]
length(out) <- max(length(yes), length(no))

But of course that wouldn't help with factor responses.

Also, if you're considering an improved ifelse(), I'd strongly urge
you to consider adding an `na` argument, so that you can use ifelse()
to transform all three possible values in a logical vector.

Hadley

-- 
http://hadley.nz

Martin Maechler

2016-Nov-15 11:58 UTC

head link

[Rd] ifelse() woes ... can we agree on a ifelse2() ?

Finally getting back to this :
>>>>> Hadley Wickham <h.wickham at gmail.com>
>>>>>     on Mon, 15 Aug 2016 07:51:35 -0500 writes:
    > On Fri, Aug 12, 2016 at 11:31 AM, Hadley Wickham
    > <h.wickham at gmail.com> wrote:
    >>> >> One possibility would also be to consider a
    >>> "numbers-only" or >> rather "same
type"-only {e.g.,
    >>> would also work for characters} >> version.
    >>>
    >>> > I don't know what you mean by these.
    >>>
    >>> In the mean time, Bob Rudis mentioned dplyr::if_else(),
    >>> which is very relevant, thank you Bob!
    >>>
    >>> As I have found, that actually works in such a "same
    >>> type"-only way: It does not try to coerce, but gives an
    >>> error when the classes differ, even in this somewhat
    >>> debatable case :
    >>>
    >>> > dplyr::if_else(c(TRUE, FALSE), 2:3, 0+10:11) Error:
    >>> `false` has type 'double' not 'integer'
    >>> >
    >>>
    >>> As documented, if_else() is clearly stricter than
    >>> ifelse() and e.g., also does no recycling (but of
    >>> length() 1).
    >>
    >> I agree that if_else() is currently too strict - it's
    >> particularly annoying if you want to replace some values
    >> with a missing:
    >>
    >> x <- sample(10) if_else(x > 5, NA, x) # Error: `false`
    >> has type 'integer' not 'logical'
    >>
    >> But I would like to make sure that this remains an error:
    >>
    >> if_else(x > 5, x, "BLAH")
    >>
    >> Because that seems more likely to be a user error (but
    >> reasonable people might certainly believe that it should
    >> just work)
    >>
    >> dplyr is more accommodating in other places (i.e. in
    >> bind_rows(), collapse() and the joins) but it's
    >> surprisingly hard to get all the details right. For
    >> example, what should the result of this call be?
    >>
    >> if_else(c(TRUE, FALSE), factor(c("a", "b")),
    >> factor(c("c", "b"))
    >>
    >> Strictly speaking I think you could argue it's an error,
    >> but that's not very user-friendly. Should it be a factor
    >> with the union of the levels? Should it be a character
    >> vector + warning? Should the behaviour change if one set
    >> of levels is a subset of the other set?
    >>
    >> There are similar issues for POSIXct (if the time zones
    >> are different, which should win?), and difftimes
    >> (similarly for units).  Ideally you'd like the behaviour
    >> to be extensible for new S3 classes, which suggests it
    >> should be a generic (and for the most general case, it
    >> would need to dispatch on both arguments).

    > One possible principle would be to use c() -
    > i.e. construct out as

    > out <- c(yes[0], no[0]
    > length(out) <- max(length(yes), length(no))

yes; this would require that a  `length<-` method works for the
class of the result.

Duncan Murdoch mentioned a version of this, in his very
first reply:

    ans <- c(yes, no)[seq_along(test)]
    ans <- ans[seq_along(test)]

which is less efficient for atomic vectors, but requires
less from the class: it "only" needs `c` and `[` to work

and a mixture of your two proposals would be possible too:

    ans <- c(yes[0], no[0])
    ans <- ans[seq_along(test)]

which does *not* work for my "mpfr" numbers (CRAN package
'Rmpfr'),
but that's a buglet in the  c.mpfr() implementation of my Rmpfr
package... (which has already been fixed in the development version on R-forge,
	    https://r-forge.r-project.org/R/?group_id=386)

    > But of course that wouldn't help with factor responses.

Yes.  However, a version of Duncan's suggestion -- of treating 'yes'
first
-- does help in that case.

For once, mainly as "feasability experiment",
I have created a github gist to make my current ifelse2() proposal available
for commenting, cloning, pullrequesting, etc:

Consisting of 2 files
- ifelse-def.R :  Functions definitions only, basically all the current
	        proposals, called  ifelse*()
- ifelse-checks.R : A simplistic checking function
 	and examples calling it, notably demonstrating that my
	ifelse2()  does work with
	"Date", <dateTime> (i.e. "POSIXct" and
"POSIXlt"), factors,
	and "mpfr" (the arbitrary-precision numbers in my package
"Rmpfr")

Also if you are not on github, you can quickly get to the ifelse2()
definition :

https://gist.github.com/mmaechler/9cfc3219c4b89649313bfe6853d87894#file-ifelse-def-r-L168

    > Also, if you're considering an improved ifelse(), I'd
    > strongly urge you to consider adding an `na` argument,

I now did (called it 'NA.').

    > so that you can use ifelse() to transform all three
    > possible values in a logical vector.

    > Hadley
    > -- http://hadley.nz

For those who really hate GH (and don't want or cannot easily follow the
above URL), here's my current definition: 


##' Martin Maechler, 14. Nov 2016 --- taking into account Duncan M. and
Hadley's
##' ideas in the R-devel thread starting at (my mom's 86th birthday):
##' https://stat.ethz.ch/pipermail/r-devel/2016-August/072970.html
ifelse2 <- function (test, yes, no, NA. = NA) {
    if(!is.logical(test)) {
        if(is.atomic(test))
            storage.mode(test) <- "logical"
        else ## typically a "class"; storage.mode<-() typically
fails
            test <- if(isS4(test)) methods::as(test, "logical")
else as.logical(test)
    }

    ## No longer optimize the  "if (a) x else y"  cases:
    ## Only "non-good" R users use ifelse(.) instead of if(.) in these
cases.

    ans <-
	tryCatch(rep(if(is.object(yes) && identical(class(yes), class(no)))
			 ## as c(o) or o[0] may not work for the class
			 yes else c(yes[0], no[0]), length.out = length(test)),
		 error = function(e) { ## try asymmetric, yes-leaning
		     r <- yes
		     r[!test] <- no[!test]
		     r
		 })
    ok <- !(nas <- is.na(test))
    if (any(test[ok]))
	ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok]
    if (any(!test[ok]))
	ans[!test & ok] <- rep(no, length.out = length(ans))[!test & ok]
    ans[nas] <- NA. # possibly coerced to class(ans)
    ans
}

Gabriel Becker

2016-Nov-15 19:56 UTC

head link

[Rd] ifelse() woes ... can we agree on a ifelse2() ?

All,

Martin: Thanks for this and all the other things you are doing to both
drive R forward and engage more with the community about things like this.

Apologies for missing this discussion the first time it came around and if
anything here has already been brought up, but I wonder what exactly you
mean when you want recycling behavior.

Specifically, based on an unrelated discussion with Henrik Bengtsson on
Twitter, I wonder if preserving the recycling behavior test is longer than
yes, no, but making the case where

length( test ) < max(length( yes ), length( no ))

would simplify usage for userRs in a useful way. I suspect it's easy to
forget that the result is not guaranteed to be the length of  test, even
for intermediate and advanced users familiar with ifelse and it's
strengths/weaknesses.

I certainly agree (for what that's worth...) that

x = rnorm(100)

y = ifelse2(x > 0, 1L, 2L)

should continue to work.

Also, If we combine a stricter contract that the output will always be of
length with the suggestion of a specified output class the pseudo code
could be

ifelse2 = function(test, yes, no, outclass) {

lenout  = length(test)

out = as( rep(yes, length.out = lenout), outclass)

out[!test] = as(rep(no, length.out = lenout)[!test], outclass)

#handle NA stuff

out

}


NAs could be tricky if outclass were allowed to be completely general, but
doable, I think? Another approach  if we ARE fast-passing while leaving
ifelse intact is that maybe NA's in test just aren't allowed in ifelse2.
I'm not saying we should definitely do that, but it's possible and would
make things faster.

Finally, In terms of efficiency, with the stuff that Luke and I are working
on, the NA detection could be virtually free in certain cases, which could
give a nice boost for long vectors  that don't have any NAs (and
'know'
that they don't).

Best,
~G

On Tue, Nov 15, 2016 at 3:58 AM, Martin Maechler <maechler at
stat.math.ethz.ch> wrote:
> Finally getting back to this :
>
> >>>>> Hadley Wickham <h.wickham at gmail.com>
> >>>>>     on Mon, 15 Aug 2016 07:51:35 -0500 writes:
>
>     > On Fri, Aug 12, 2016 at 11:31 AM, Hadley Wickham
>     > <h.wickham at gmail.com> wrote:
>     >>> >> One possibility would also be to consider a
>     >>> "numbers-only" or >> rather "same
type"-only {e.g.,
>     >>> would also work for characters} >> version.
>     >>>
>     >>> > I don't know what you mean by these.
>     >>>
>     >>> In the mean time, Bob Rudis mentioned dplyr::if_else(),
>     >>> which is very relevant, thank you Bob!
>     >>>
>     >>> As I have found, that actually works in such a "same
>     >>> type"-only way: It does not try to coerce, but gives
an
>     >>> error when the classes differ, even in this somewhat
>     >>> debatable case :
>     >>>
>     >>> > dplyr::if_else(c(TRUE, FALSE), 2:3, 0+10:11) Error:
>     >>> `false` has type 'double' not 'integer'
>     >>> >
>     >>>
>     >>> As documented, if_else() is clearly stricter than
>     >>> ifelse() and e.g., also does no recycling (but of
>     >>> length() 1).
>     >>
>     >> I agree that if_else() is currently too strict - it's
>     >> particularly annoying if you want to replace some values
>     >> with a missing:
>     >>
>     >> x <- sample(10) if_else(x > 5, NA, x) # Error: `false`
>     >> has type 'integer' not 'logical'
>     >>
>     >> But I would like to make sure that this remains an error:
>     >>
>     >> if_else(x > 5, x, "BLAH")
>     >>
>     >> Because that seems more likely to be a user error (but
>     >> reasonable people might certainly believe that it should
>     >> just work)
>     >>
>     >> dplyr is more accommodating in other places (i.e. in
>     >> bind_rows(), collapse() and the joins) but it's
>     >> surprisingly hard to get all the details right. For
>     >> example, what should the result of this call be?
>     >>
>     >> if_else(c(TRUE, FALSE), factor(c("a",
"b")),
>     >> factor(c("c", "b"))
>     >>
>     >> Strictly speaking I think you could argue it's an error,
>     >> but that's not very user-friendly. Should it be a factor
>     >> with the union of the levels? Should it be a character
>     >> vector + warning? Should the behaviour change if one set
>     >> of levels is a subset of the other set?
>     >>
>     >> There are similar issues for POSIXct (if the time zones
>     >> are different, which should win?), and difftimes
>     >> (similarly for units).  Ideally you'd like the behaviour
>     >> to be extensible for new S3 classes, which suggests it
>     >> should be a generic (and for the most general case, it
>     >> would need to dispatch on both arguments).
>
>     > One possible principle would be to use c() -
>     > i.e. construct out as
>
>     > out <- c(yes[0], no[0]
>     > length(out) <- max(length(yes), length(no))
>
> yes; this would require that a  `length<-` method works for the
> class of the result.
>
> Duncan Murdoch mentioned a version of this, in his very
> first reply:
>
>     ans <- c(yes, no)[seq_along(test)]
>     ans <- ans[seq_along(test)]
>
> which is less efficient for atomic vectors, but requires
> less from the class: it "only" needs `c` and `[` to work
>
> and a mixture of your two proposals would be possible too:
>
>     ans <- c(yes[0], no[0])
>     ans <- ans[seq_along(test)]
>
> which does *not* work for my "mpfr" numbers (CRAN package
'Rmpfr'),
> but that's a buglet in the  c.mpfr() implementation of my Rmpfr
> package... (which has already been fixed in the development version on
> R-forge,
>             https://r-forge.r-project.org/R/?group_id=386)
>
>     > But of course that wouldn't help with factor responses.
>
> Yes.  However, a version of Duncan's suggestion -- of treating
'yes' first
> -- does help in that case.
>
> For once, mainly as "feasability experiment",
> I have created a github gist to make my current ifelse2() proposal
> available
> for commenting, cloning, pullrequesting, etc:
>
> Consisting of 2 files
> - ifelse-def.R :  Functions definitions only, basically all the current
>                 proposals, called  ifelse*()
> - ifelse-checks.R : A simplistic checking function
>         and examples calling it, notably demonstrating that my
>         ifelse2()  does work with
>         "Date", <dateTime> (i.e. "POSIXct" and
"POSIXlt"), factors,
>         and "mpfr" (the arbitrary-precision numbers in my package
"Rmpfr")
>
> Also if you are not on github, you can quickly get to the ifelse2()
> definition :
>
> https://gist.github.com/mmaechler/9cfc3219c4b89649313bfe6853d878
> 94#file-ifelse-def-r-L168
>
>     > Also, if you're considering an improved ifelse(), I'd
>     > strongly urge you to consider adding an `na` argument,
>
> I now did (called it 'NA.').
>
>     > so that you can use ifelse() to transform all three
>     > possible values in a logical vector.
>
>     > Hadley
>     > -- http://hadley.nz
>
> For those who really hate GH (and don't want or cannot easily follow
the
> above URL), here's my current definition:
>
>
> ##' Martin Maechler, 14. Nov 2016 --- taking into account Duncan M. and
> Hadley's
> ##' ideas in the R-devel thread starting at (my mom's 86th
birthday):
> ##' https://stat.ethz.ch/pipermail/r-devel/2016-August/072970.html
> ifelse2 <- function (test, yes, no, NA. = NA) {
>     if(!is.logical(test)) {
>         if(is.atomic(test))
>             storage.mode(test) <- "logical"
>         else ## typically a "class"; storage.mode<-()
typically fails
>             test <- if(isS4(test)) methods::as(test,
"logical") else
> as.logical(test)
>     }
>
>     ## No longer optimize the  "if (a) x else y"  cases:
>     ## Only "non-good" R users use ifelse(.) instead of if(.) in
these
> cases.
>
>     ans <-
>         tryCatch(rep(if(is.object(yes) && identical(class(yes),
class(no)))
>                          ## as c(o) or o[0] may not work for the class
>                          yes else c(yes[0], no[0]), length.out >
length(test)),
>                  error = function(e) { ## try asymmetric, yes-leaning
>                      r <- yes
>                      r[!test] <- no[!test]
>                      r
>                  })
>     ok <- !(nas <- is.na(test))
>     if (any(test[ok]))
>         ans[test & ok] <- rep(yes, length.out = length(ans))[test
& ok]
>     if (any(!test[ok]))
>         ans[!test & ok] <- rep(no, length.out = length(ans))[!test
& ok]
>     ans[nas] <- NA. # possibly coerced to class(ans)
>     ans
> }
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more maybe matching threads

R devel - Nov 2016 - ifelse() woes ... can we agree on a ifelse2() ?

[Rd] ifelse() woes ... can we agree on a ifelse2() ?

[Rd] ifelse() woes ... can we agree on a ifelse2() ?

[Rd] ifelse() woes ... can we agree on a ifelse2() ?

Possibly Parallel Threads