Hugh,
(Note I speak for myself only and not for R-core) Thanks for looking into
this. I think it's great to have community members that are interested in
contributing to R and helping it continue to get better.
And I think, and my local experiments bear out, that using anyNA as a
fastpass condition does allow us to get a significant speedup over what's
in there now. To do so, though, I took a somewhat different approach than
your proposal:
ifelse2 = function(test, yes, no) {
if (is.atomic(test)) {
if (typeof(test) != "logical")
storage.mode(test) <- "logical"
if (length(test) == 1 && is.null(attributes(test))) {
if (is.na(test))
return(NA)
else if (test) {
if (length(yes) == 1) {
yat <- attributes(yes)
if (is.null(yat) || (is.function(yes) &&
identical(names(yat),
"srcref")))
return(yes)
}
}
else if (length(no) == 1) {
nat <- attributes(no)
if (is.null(nat) || (is.function(no) &&
identical(names(nat),
"srcref")))
return(no)
}
}
}
else test <- if (isS4(test))
methods::as(test, "logical")
else as.logical(test)
## this is to ensure the documented behavior re: attributes of result
ans <- test
len = length(ans)
if(nonas <- !anyNA(test)) {
ypos = test
npos = !test
} else {
ok <- !(nas <- is.na(test))
ypos = test & ok
npos = !test & ok
}
if(any(ypos, na.rm = TRUE)) ##equivalent to any(test[ok])
ans[ypos] = rep(yes, length.out = len)[ypos]
if(any(npos, na.rm = TRUE)) ##equivalent to any(!test[ok])
ans[npos] = rep(no, length.out = len)[npos]
## This is in the original but I don't see why it's necessary
## due to ans being initialized to test. The NAs should already
## be there...
if(!nonas)
ans[nas] = NA
ans
}
On my machine, after an initial call to invoke the JIT and get the function
compiled, this is faster at lengths of test 100 and 10000 (with the lengths
of yes and no at 10% of the length of test) by ~1.7x and ~2x respectively
for no NAs and ~1.3x and ~1.6x respectively for 10% NAs.
The key, from what I saw, is to avoid as much &ing and subsetting as we
can. If there are no NAs none of the test&ok or test[ok] operations do
anything because ok has only TRUEs in it. Even when there are, we want to
do the & once and avoid test[ok].
There are further savings for the NAs present case if I'm correct about the
ans[nas] = NA being redundant and we're able to remove that as well.
I'm happy to submit this as a patch and share credit if that is ok with
you. Let me know.
Best,
On Thu, May 3, 2018 at 9:58 PM, Hugh Parsonage <hugh.parsonage at
gmail.com>
wrote:
> Thanks Radford. I concur with all your points. I've attempted to
address
> the issues you raised through the github.io post. The new method appears
> to be slower for test lengths < 100 and possibly longer lengths (not
just <
> 10). Of course length(test) < 100 is very quick, so I simply added this
to
> the conditions that cause the old ifelse method to be invoked. I'll
leave
> it to R-core to decide whether or not the benefits for longer vectors are
> worth it.
>
>
>
>
>
>
> On Fri, 4 May 2018 at 01:01 Radford Neal <radford at cs.toronto.edu>
wrote:
>
> > > I propose a patch to ifelse that leverages anyNA(test) to achieve
an
> > > improvement in performance. For a test vector of length 10, the
change
> > > nearly halves the time taken and for a test of length 1 million,
there
> > > is a tenfold increase in speed. Even for small vectors, the
> > > distributions of timings between the old and the proposed ifelse
do
> > > not intersect.
> >
> > For smaller vectors, your results are significantly affected by your
> > invoking the old version via base::ifelse. You could try defining
> > your new version as new_ifelse, and invoking the old version as just
> > ifelse. There might still be some issues with the two versions having
> > different context w.r.t environments, and hence looking up functions
> > in different ways. You could copy the code of the old version and
> > define it in the global environment just like new_ifelse.
> >
> > When using ifelse rather than base::ifelse, it seems the new version
> > is slower for vectors of length 10, but faster for long vectors.
> >
> > Also, I'd use system.time rather than microbenchmark. The latter
will
> > mix invocations of the two functions in a way where it is unclear that
> > garbage collection time will be fairly attributed. Also, it's a
bit
> > silly to plot the distributions of times, which will mostly reflect
> > variations in when garbage collections at various levels occur - just
> > the mean is what is relevant.
> >
> > Regards,
> >
> > Radford Neal
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
--
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research
[[alternative HTML version deleted]]