thr3ads.net - R help - [R] Possible Improvement to sapply [Mar 2018]

If this information is useful, please help other people find it:
Share via:

William Dunlap

2018-Mar-13 16:14 UTC

[R] Possible Improvement to sapply

Could your code use vapply instead of sapply?  vapply forces you to declare
the type and dimensions
of FUN's output and stops if any call to FUN does not match the
declaration.  It can use much less
memory and time than sapply because it fills in the output array as it goes
instead of calling lapply()
and seeing how it could be simplified.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold <HDoran at air.org> wrote:
> Martin
>
> In terms of context of the actual problem, sapply is called millions of
> times because the work involves scoring individual students who took a
> test. A score for student A is generated and then student B and such and
> there are millions of students. The psychometric process of scoring
> students is complex and our code makes use of sapply many times for each
> student.
>
> The toy example used length just to illustrate, our actual code doesn't
do
> that. But your point is well taken, there may be a very good counterexample
> why my proposal doesn't achieve the goal is a generalizable way.
>
>
>
> -----Original Message-----
> From: Martin Morgan [mailto:martin.morgan at roswellpark.org]
> Sent: Tuesday, March 13, 2018 9:43 AM
> To: Doran, Harold <HDoran at air.org>; 'r-help at
r-project.org' <
> r-help at r-project.org>
> Subject: Re: [R] Possible Improvement to sapply
>
>
>
> On 03/13/2018 09:23 AM, Doran, Harold wrote:
> > While working with sapply, the documentation states that the simplify
> > argument will yield a vector, matrix etc "when possible". I
was
> > curious how the code actually defined "as possible" and see
this
> > within the function
> >
> > if (!identical(simplify, FALSE) && length(answer))
> >
> > This seems superfluous to me, in particular this part:
> >
> > !identical(simplify, FALSE)
> >
> > The preceding code could be reduced to
> >
> > if (simplify && length(answer))
> >
> > and it would not need to execute the call to identical in order to
> trigger the conditional execution, which is known from the user's
simplify
> = TRUE or FALSE inputs. I *think* the extra call to identical is just
> unnecessary overhead in this instance.
> >
> > Take for example, the following toy example code and benchmark results
> and a small modification to sapply:
> >
> > myList <- list(a = rnorm(100), b = rnorm(100))
> >
> > answer <- lapply(X = myList, FUN = length) simplify = TRUE
> >
> > library(microbenchmark)
> >
> > mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES =
TRUE){
> >       FUN <- match.fun(FUN)
> >      answer <- lapply(X = X, FUN = FUN, ...)
> >      if (USE.NAMES && is.character(X) &&
is.null(names(answer)))
> >          names(answer) <- X
> >      if (simplify && length(answer))
> >          simplify2array(answer, higher = (simplify ==
"array"))
> >      else answer
> > }
> >
> >
> >> microbenchmark(sapply(myList, length), times = 10000L)
> > Unit: microseconds
> >                     expr    min     lq     mean median     uq    max
> neval
> >   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
> > 10000
> >> microbenchmark(mySapply(myList, length), times = 10000L)
> > Unit: microseconds
> >                       expr    min     lq     mean median     uq     
max
> neval
> >   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573
> > 1671.804 10000
> >
> > My benchmark timings show a timing improvement with only that small
> change made and it is seemingly nominal. In my actual work, the sapply
> function is called millions of times and this additional overhead
> propagates to some overall additional computing time.
> >
> > I have done some limited testing on various real data to verify that
the
> objects produced under both variants of the sapply (base R and my modified)
> yield identical objects when simply is both TRUE or FALSE.
> >
> > Perhaps someone else sees a counterexample where my proposed fix does
> not cause for sapply to behave as expected.
> >
>
> Check out ?sapply for possible values of `simplify=` to see why your
> proposal is not adequate.
>
> For your example, lengths() is an order of magnitude faster than sapply(.,
> length). This is a example of the advantages of vectorization (single call
> to an R function implemented in C) versus iteration (`for` loops but also
> the *apply family calling an R function many times).
> vapply() might also be relevant.
>
> Often performance improvements come from looking one layer up from where
> the problem occurs and re-thinking the algorithm. Why would one need to
> call sapply() millions of times, in a situation where this becomes
> rate-limiting? Can the algorithm be re-implemented to avoid this step?
>
> Martin Morgan
>
> > Harold
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Henrik Bengtsson

2018-Mar-13 17:12 UTC

head link

[R] Possible Improvement to sapply

FYI, in R devel (to become 3.5.0), there's isFALSE() which will cut
some corners compared to identical():
> microbenchmark::microbenchmark(identical(FALSE, FALSE), isFALSE(FALSE))Unit: nanoseconds
                    expr min   lq    mean median     uq   max neval
 identical(FALSE, FALSE) 984 1138 1694.13 1218.0 1337.5 13584   100
          isFALSE(FALSE) 713  761 1133.53  809.5  871.5 18619   100
> microbenchmark::microbenchmark(identical(TRUE, FALSE), isFALSE(TRUE))Unit: nanoseconds
                   expr  min     lq    mean median   uq   max neval
 identical(TRUE, FALSE) 1009 1103.5 2228.20 1170.5 1357 14346   100
          isFALSE(TRUE)  718  760.0 1298.98  798.0  898 17782   100
> microbenchmark::microbenchmark(identical("array", FALSE),
isFALSE("array"))Unit: nanoseconds
                      expr min     lq    mean median     uq  max neval
 identical("array", FALSE) 975 1058.5 1257.95 1119.5 1250.0 9299   100
          isFALSE("array") 409  433.5  658.76  446.0  476.5 9383   100

That could probably be used also is sapply().  The difference is that
isFALSE() is a bit more liberal than identical(x, FALSE), e.g.
> isFALSE(c(a = FALSE))
[1] TRUE> identical(c(a = FALSE), FALSE)[1] FALSE

Assuming the latter is not an issue, there are 69 places in base R
where isFALSE() could be used:

$ grep -E "identical[(][^,]+,[ ]*FALSE[)]" -r
--include="*.R" | grep
-F "/R/" | wc
     69     326    5472

and another 59 where isTRUE() can be used:

$ grep -E "identical[(][^,]+,[ ]*TRUE[)]" -r --include="*.R"
| grep -F
"/R/" | wc
     59     307    5021

/Henrik

On Tue, Mar 13, 2018 at 9:21 AM, Doran, Harold <HDoran at air.org>
wrote:> Quite possibly, and I?ll look into that. Aside from the work I was doing,
however, I wonder if there is a way such that sapply could avoid the overhead of
having to call the identical function to determine the conditional path.
>
>
>
> From: William Dunlap [mailto:wdunlap at tibco.com]
> Sent: Tuesday, March 13, 2018 12:14 PM
> To: Doran, Harold <HDoran at air.org>
> Cc: Martin Morgan <martin.morgan at roswellpark.org>; r-help at
r-project.org
> Subject: Re: [R] Possible Improvement to sapply
>
> Could your code use vapply instead of sapply?  vapply forces you to declare
the type and dimensions
> of FUN's output and stops if any call to FUN does not match the
declaration.  It can use much less
> memory and time than sapply because it fills in the output array as it goes
instead of calling lapply()
> and seeing how it could be simplified.
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com<http://tibco.com>
>
> On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold <HDoran at
air.org<mailto:HDoran at air.org>> wrote:
> Martin
>
> In terms of context of the actual problem, sapply is called millions of
times because the work involves scoring individual students who took a test. A
score for student A is generated and then student B and such and there are
millions of students. The psychometric process of scoring students is complex
and our code makes use of sapply many times for each student.
>
> The toy example used length just to illustrate, our actual code doesn't
do that. But your point is well taken, there may be a very good counterexample
why my proposal doesn't achieve the goal is a generalizable way.
>
>
>
> -----Original Message-----
> From: Martin Morgan [mailto:martin.morgan at
roswellpark.org<mailto:martin.morgan at roswellpark.org>]
> Sent: Tuesday, March 13, 2018 9:43 AM
> To: Doran, Harold <HDoran at air.org<mailto:HDoran at
air.org>>; 'r-help at r-project.org<mailto:r-help at
r-project.org>' <r-help at r-project.org<mailto:r-help at
r-project.org>>
> Subject: Re: [R] Possible Improvement to sapply
>
>
>
> On 03/13/2018 09:23 AM, Doran, Harold wrote:
>> While working with sapply, the documentation states that the simplify
>> argument will yield a vector, matrix etc "when possible". I
was
>> curious how the code actually defined "as possible" and see
this
>> within the function
>>
>> if (!identical(simplify, FALSE) && length(answer))
>>
>> This seems superfluous to me, in particular this part:
>>
>> !identical(simplify, FALSE)
>>
>> The preceding code could be reduced to
>>
>> if (simplify && length(answer))
>>
>> and it would not need to execute the call to identical in order to
trigger the conditional execution, which is known from the user's simplify =
TRUE or FALSE inputs. I *think* the extra call to identical is just unnecessary
overhead in this instance.
>>
>> Take for example, the following toy example code and benchmark results
and a small modification to sapply:
>>
>> myList <- list(a = rnorm(100), b = rnorm(100))
>>
>> answer <- lapply(X = myList, FUN = length) simplify = TRUE
>>
>> library(microbenchmark)
>>
>> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES =
TRUE){
>>       FUN <- match.fun(FUN)
>>      answer <- lapply(X = X, FUN = FUN, ...)
>>      if (USE.NAMES && is.character(X) &&
is.null(names(answer)))
>>          names(answer) <- X
>>      if (simplify && length(answer))
>>          simplify2array(answer, higher = (simplify ==
"array"))
>>      else answer
>> }
>>
>>
>>> microbenchmark(sapply(myList, length), times = 10000L)
>> Unit: microseconds
>>                     expr    min     lq     mean median     uq    max
neval
>>   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
>> 10000
>>> microbenchmark(mySapply(myList, length), times = 10000L)
>> Unit: microseconds
>>                       expr    min     lq     mean median     uq     
max neval
>>   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573
>> 1671.804 10000
>>
>> My benchmark timings show a timing improvement with only that small
change made and it is seemingly nominal. In my actual work, the sapply function
is called millions of times and this additional overhead propagates to some
overall additional computing time.
>>
>> I have done some limited testing on various real data to verify that
the objects produced under both variants of the sapply (base R and my modified)
yield identical objects when simply is both TRUE or FALSE.
>>
>> Perhaps someone else sees a counterexample where my proposed fix does
not cause for sapply to behave as expected.
>>
>
> Check out ?sapply for possible values of `simplify=` to see why your
proposal is not adequate.
>
> For your example, lengths() is an order of magnitude faster than sapply(.,
length). This is a example of the advantages of vectorization (single call to an
R function implemented in C) versus iteration (`for` loops but also the *apply
family calling an R function many times).
> vapply() might also be relevant.
>
> Often performance improvements come from looking one layer up from where
the problem occurs and re-thinking the algorithm. Why would one need to call
sapply() millions of times, in a situation where this becomes rate-limiting? Can
the algorithm be re-implemented to avoid this step?
>
> Martin Morgan
>
>> Harold
>>
>> ______________________________________________
>> R-help at r-project.org<mailto:R-help at r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> This email message may contain legally privileged and/or confidential
information.  If you are not the intended recipient(s), or the employee or agent
responsible for the delivery of this message to the intended recipient(s), you
are hereby notified that any disclosure, copying, distribution, or use of this
email message is prohibited.  If you have received this message in error, please
notify the sender immediately by e-mail and delete this email message from your
computer. Thank you.
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org> mailing list
-- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more reasonably related threads

R help - Mar 2018 - Possible Improvement to sapply

[R] Possible Improvement to sapply

[R] Possible Improvement to sapply

Apparently Analagous Threads