thr3ads.net - R help - [R] Possible Improvement to sapply [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2018-Mar-13 13:23 UTC

[R] Possible Improvement to sapply

While working with sapply, the documentation states that the simplify argument
will yield a vector, matrix etc "when possible". I was curious how the
code actually defined "as possible" and see this within the function

if (!identical(simplify, FALSE) && length(answer))

This seems superfluous to me, in particular this part:

!identical(simplify, FALSE)

The preceding code could be reduced to 

if (simplify && length(answer))

and it would not need to execute the call to identical in order to trigger the
conditional execution, which is known from the user's simplify = TRUE or
FALSE inputs. I *think* the extra call to identical is just unnecessary overhead
in this instance.

Take for example, the following toy example code and benchmark results and a
small modification to sapply:

myList <- list(a = rnorm(100), b = rnorm(100))

answer <- lapply(X = myList, FUN = length)
simplify = TRUE

library(microbenchmark)

mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
	FUN <- match.fun(FUN)
    answer <- lapply(X = X, FUN = FUN, ...)
    if (USE.NAMES && is.character(X) && is.null(names(answer))) 
        names(answer) <- X
    if (simplify && length(answer)) 
        simplify2array(answer, higher = (simplify == "array"))
    else answer
}

> microbenchmark(sapply(myList, length), times = 10000L)Unit: microseconds
                   expr    min     lq     mean median     uq    max neval
 sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
10000> microbenchmark(mySapply(myList, length), times = 10000L)Unit: microseconds
                     expr    min     lq     mean median     uq      max neval
 mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 10000

My benchmark timings show a timing improvement with only that small change made
and it is seemingly nominal. In my actual work, the sapply function is called
millions of times and this additional overhead propagates to some overall
additional computing time.

I have done some limited testing on various real data to verify that the objects
produced under both variants of the sapply (base R and my modified) yield
identical objects when simply is both TRUE or FALSE.

Perhaps someone else sees a counterexample where my proposed fix does not cause
for sapply to behave as expected.

Harold

Martin Morgan

2018-Mar-13 13:43 UTC

head link

[R] Possible Improvement to sapply

On 03/13/2018 09:23 AM, Doran, Harold wrote:> While working with sapply, the documentation states that the simplify
argument will yield a vector, matrix etc "when possible". I was
curious how the code actually defined "as possible" and see this
within the function
> 
> if (!identical(simplify, FALSE) && length(answer))
> 
> This seems superfluous to me, in particular this part:
> 
> !identical(simplify, FALSE)
> 
> The preceding code could be reduced to
> 
> if (simplify && length(answer))
> 
> and it would not need to execute the call to identical in order to trigger
the conditional execution, which is known from the user's simplify = TRUE or
FALSE inputs. I *think* the extra call to identical is just unnecessary overhead
in this instance.
> 
> Take for example, the following toy example code and benchmark results and
a small modification to sapply:
> 
> myList <- list(a = rnorm(100), b = rnorm(100))
> 
> answer <- lapply(X = myList, FUN = length)
> simplify = TRUE
> 
> library(microbenchmark)
> 
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
> 	FUN <- match.fun(FUN)
>      answer <- lapply(X = X, FUN = FUN, ...)
>      if (USE.NAMES && is.character(X) &&
is.null(names(answer)))
>          names(answer) <- X
>      if (simplify && length(answer))
>          simplify2array(answer, higher = (simplify == "array"))
>      else answer
> }
> 
> 
>> microbenchmark(sapply(myList, length), times = 10000L)
> Unit: microseconds
>                     expr    min     lq     mean median     uq    max neval
>   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 10000
>> microbenchmark(mySapply(myList, length), times = 10000L)
> Unit: microseconds
>                       expr    min     lq     mean median     uq      max
neval
>   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804
10000
> 
> My benchmark timings show a timing improvement with only that small change
made and it is seemingly nominal. In my actual work, the sapply function is
called millions of times and this additional overhead propagates to some overall
additional computing time.
> 
> I have done some limited testing on various real data to verify that the
objects produced under both variants of the sapply (base R and my modified)
yield identical objects when simply is both TRUE or FALSE.
> 
> Perhaps someone else sees a counterexample where my proposed fix does not
cause for sapply to behave as expected.
> 
Check out ?sapply for possible values of `simplify=` to see why your 
proposal is not adequate.

For your example, lengths() is an order of magnitude faster than 
sapply(., length). This is a example of the advantages of vectorization 
(single call to an R function implemented in C) versus iteration (`for` 
loops but also the *apply family calling an R function many times). 
vapply() might also be relevant.

Often performance improvements come from looking one layer up from where 
the problem occurs and re-thinking the algorithm. Why would one need to 
call sapply() millions of times, in a situation where this becomes 
rate-limiting? Can the algorithm be re-implemented to avoid this step?

Martin Morgan
> Harold
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

This email message may contain legally privileged and/or...{{dropped:2}}

Doran, Harold

2018-Mar-13 14:06 UTC

head link

[R] Possible Improvement to sapply

Martin

In terms of context of the actual problem, sapply is called millions of times
because the work involves scoring individual students who took a test. A score
for student A is generated and then student B and such and there are millions of
students. The psychometric process of scoring students is complex and our code
makes use of sapply many times for each student.

The toy example used length just to illustrate, our actual code doesn't do
that. But your point is well taken, there may be a very good counterexample why
my proposal doesn't achieve the goal is a generalizable way.

-----Original Message-----
From: Martin Morgan [mailto:martin.morgan at roswellpark.org] 
Sent: Tuesday, March 13, 2018 9:43 AM
To: Doran, Harold <HDoran at air.org>; 'r-help at r-project.org'
<r-help at r-project.org>
Subject: Re: [R] Possible Improvement to sapply

On 03/13/2018 09:23 AM, Doran, Harold wrote:> While working with sapply, the documentation states that the simplify 
> argument will yield a vector, matrix etc "when possible". I was 
> curious how the code actually defined "as possible" and see this 
> within the function
> 
> if (!identical(simplify, FALSE) && length(answer))
> 
> This seems superfluous to me, in particular this part:
> 
> !identical(simplify, FALSE)
> 
> The preceding code could be reduced to
> 
> if (simplify && length(answer))
> 
> and it would not need to execute the call to identical in order to trigger
the conditional execution, which is known from the user's simplify = TRUE or
FALSE inputs. I *think* the extra call to identical is just unnecessary overhead
in this instance.
> 
> Take for example, the following toy example code and benchmark results and
a small modification to sapply:
> 
> myList <- list(a = rnorm(100), b = rnorm(100))
> 
> answer <- lapply(X = myList, FUN = length) simplify = TRUE
> 
> library(microbenchmark)
> 
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
> 	FUN <- match.fun(FUN)
>      answer <- lapply(X = X, FUN = FUN, ...)
>      if (USE.NAMES && is.character(X) &&
is.null(names(answer)))
>          names(answer) <- X
>      if (simplify && length(answer))
>          simplify2array(answer, higher = (simplify == "array"))
>      else answer
> }
> 
> 
>> microbenchmark(sapply(myList, length), times = 10000L)
> Unit: microseconds
>                     expr    min     lq     mean median     uq    max neval
>   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 
> 10000
>> microbenchmark(mySapply(myList, length), times = 10000L)
> Unit: microseconds
>                       expr    min     lq     mean median     uq      max
neval
>   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 
> 1671.804 10000
> 
> My benchmark timings show a timing improvement with only that small change
made and it is seemingly nominal. In my actual work, the sapply function is
called millions of times and this additional overhead propagates to some overall
additional computing time.
> 
> I have done some limited testing on various real data to verify that the
objects produced under both variants of the sapply (base R and my modified)
yield identical objects when simply is both TRUE or FALSE.
> 
> Perhaps someone else sees a counterexample where my proposed fix does not
cause for sapply to behave as expected.
> 
Check out ?sapply for possible values of `simplify=` to see why your proposal is
not adequate.

For your example, lengths() is an order of magnitude faster than sapply(.,
length). This is a example of the advantages of vectorization (single call to an
R function implemented in C) versus iteration (`for` loops but also the *apply
family calling an R function many times).
vapply() might also be relevant.

Often performance improvements come from looking one layer up from where the
problem occurs and re-thinking the algorithm. Why would one need to call
sapply() millions of times, in a situation where this becomes rate-limiting? Can
the algorithm be re-implemented to avoid this step?

Martin Morgan
> Harold
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

This email message may contain legally privileged and/or confidential
information.  If you are not the intended recipient(s), or the employee or agent
responsible for the delivery of this message to the intended recipient(s), you
are hereby notified that any disclosure, copying, distribution, or use of this
email message is prohibited.  If you have received this message in error, please
notify the sender immediately by e-mail and delete this email message from your
computer. Thank you.

William Dunlap

2018-Mar-13 16:10 UTC

head link

[R] Possible Improvement to sapply

Wouldn't that change how simplify='array' is handled?
>  str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
 int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...>  str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
 int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...>  str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))List of 3
 $ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
 $ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
 $ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold <HDoran at air.org> wrote:
> While working with sapply, the documentation states that the simplify
> argument will yield a vector, matrix etc "when possible". I was
curious how
> the code actually defined "as possible" and see this within the
function
>
> if (!identical(simplify, FALSE) && length(answer))
>
> This seems superfluous to me, in particular this part:
>
> !identical(simplify, FALSE)
>
> The preceding code could be reduced to
>
> if (simplify && length(answer))
>
> and it would not need to execute the call to identical in order to trigger
> the conditional execution, which is known from the user's simplify =
TRUE
> or FALSE inputs. I *think* the extra call to identical is just unnecessary
> overhead in this instance.
>
> Take for example, the following toy example code and benchmark results and
> a small modification to sapply:
>
> myList <- list(a = rnorm(100), b = rnorm(100))
>
> answer <- lapply(X = myList, FUN = length)
> simplify = TRUE
>
> library(microbenchmark)
>
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
>         FUN <- match.fun(FUN)
>     answer <- lapply(X = X, FUN = FUN, ...)
>     if (USE.NAMES && is.character(X) &&
is.null(names(answer)))
>         names(answer) <- X
>     if (simplify && length(answer))
>         simplify2array(answer, higher = (simplify == "array"))
>     else answer
> }
>
>
> > microbenchmark(sapply(myList, length), times = 10000L)
> Unit: microseconds
>                    expr    min     lq     mean median     uq    max neval
>  sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 10000
> > microbenchmark(mySapply(myList, length), times = 10000L)
> Unit: microseconds
>                      expr    min     lq     mean median     uq      max
> neval
>  mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804
> 10000
>
> My benchmark timings show a timing improvement with only that small change
> made and it is seemingly nominal. In my actual work, the sapply function is
> called millions of times and this additional overhead propagates to some
> overall additional computing time.
>
> I have done some limited testing on various real data to verify that the
> objects produced under both variants of the sapply (base R and my modified)
> yield identical objects when simply is both TRUE or FALSE.
>
> Perhaps someone else sees a counterexample where my proposed fix does not
> cause for sapply to behave as expected.
>
> Harold
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Doran, Harold

2018-Mar-13 16:14 UTC

head link

[R] Possible Improvement to sapply

You?re right, it sure does. My suggestion causes it to fail when simplify =
?array?

From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Tuesday, March 13, 2018 12:11 PM
To: Doran, Harold <HDoran at air.org>
Cc: r-help at r-project.org
Subject: Re: [R] Possible Improvement to sapply

Wouldn't that change how simplify='array' is handled?
>  str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
 int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...>  str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
 int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...>  str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))List of 3
 $ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
 $ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
 $ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold <HDoran at
air.org<mailto:HDoran at air.org>> wrote:
While working with sapply, the documentation states that the simplify argument
will yield a vector, matrix etc "when possible". I was curious how the
code actually defined "as possible" and see this within the function

if (!identical(simplify, FALSE) && length(answer))

This seems superfluous to me, in particular this part:

!identical(simplify, FALSE)

The preceding code could be reduced to

if (simplify && length(answer))

and it would not need to execute the call to identical in order to trigger the
conditional execution, which is known from the user's simplify = TRUE or
FALSE inputs. I *think* the extra call to identical is just unnecessary overhead
in this instance.

Take for example, the following toy example code and benchmark results and a
small modification to sapply:

myList <- list(a = rnorm(100), b = rnorm(100))

answer <- lapply(X = myList, FUN = length)
simplify = TRUE

library(microbenchmark)

mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
        FUN <- match.fun(FUN)
    answer <- lapply(X = X, FUN = FUN, ...)
    if (USE.NAMES && is.character(X) && is.null(names(answer)))
        names(answer) <- X
    if (simplify && length(answer))
        simplify2array(answer, higher = (simplify == "array"))
    else answer
}

> microbenchmark(sapply(myList, length), times = 10000L)Unit: microseconds
                   expr    min     lq     mean median     uq    max neval
 sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
10000> microbenchmark(mySapply(myList, length), times = 10000L)Unit: microseconds
                     expr    min     lq     mean median     uq      max neval
 mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 10000

My benchmark timings show a timing improvement with only that small change made
and it is seemingly nominal. In my actual work, the sapply function is called
millions of times and this additional overhead propagates to some overall
additional computing time.

I have done some limited testing on various real data to verify that the objects
produced under both variants of the sapply (base R and my modified) yield
identical objects when simply is both TRUE or FALSE.

Perhaps someone else sees a counterexample where my proposed fix does not cause
for sapply to behave as expected.

Harold

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more reasonably related threads

R help - Mar 2018 - Possible Improvement to sapply

[R] Possible Improvement to sapply

[R] Possible Improvement to sapply

[R] Possible Improvement to sapply

[R] Possible Improvement to sapply

[R] Possible Improvement to sapply

Possibly Parallel Threads