thr3ads.net - R help - [R] dramatic speed difference in lapply [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Rob Forler

2010-Feb-26 19:58 UTC

[R] dramatic speed difference in lapply

So I have a function that does lapply's for me based on dimension. Currently
only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I
have two versions. One runs WAYYY faster than the other. And I'm not sure
why.

Fast Version:

fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,
...){
    lapplyFunctionRecurse <- function(cdata, level=1, ...){
        if(level==1){

return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
function(x) lapplyFunctionRecurse(x, level+1, ...)))
        } else if (level==length(pivotColumns)) {
            #
return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
function(x, ...) listNameFunctions(data[x,], ...)))
            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
sum(data[cdata,"A"], na.rm=T))))
        } else {
            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
        }
    }
    result = lapplyFunctionRecurse(data, ...)
    matrix2 <- do.call('rbind', lapply(result, function(x)
do.call('rbind',x)))
    return(matrix2)
}


dat <- data.frame(D=sample(32000:33000, 666000,
T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));
proc.time()-temp
   user  system elapsed
 4.616   0.006   4.630
#note in thie case the anonymous function I pass in isn't used because I
hardcode the function into the lapply.

approx 4 seconds

This runs very fast. This runs very slow:

fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){
    lapplyFunctionRecurse <- function(cdata, level=1, ...){
        if(level==1){

return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
function(x) lapplyFunctionRecurse(x, level+1, ...)))
        } else if (level==length(pivotColumns)) {
            #this line is different. it essentially calls the function you
pass in
            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
        } else {
            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
        }
    }
    result = lapplyFunctionRecurse(data, ...)
    matrix2 <- do.call('rbind', lapply(result, function(x)
do.call('rbind',x)))
    return(matrix2)
}

dat <- data.frame(D=sample(32000:33000, 666000,
T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));
proc.time()-temp
   user  system elapsed
 16.346  65.059  81.680



Can anyone explain to me why there is a 4x time difference? I don't want to
have to hardcore into the recursion function, but if I have to I will.

Thanks,
Rob

	[[alternative HTML version deleted]]

jim holtman

2010-Feb-26 21:18 UTC

head link

[R] dramatic speed difference in lapply

On my computer your two examples seem to execute about the
same:> fedb.ddplyWrapper2Fast <- function(data, pivotColumns,
listNameFunctions,+ ...){
+         lapplyFunctionRecurse <- function(cdata, level=1, ...){
+            if(level==1){
+
+
return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]],
drop=T),
+                 function(x) lapplyFunctionRecurse(x, level+1, ...)))
+            } else if (level==length(pivotColumns)) {
+                            #
+
return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
+                 function(x, ...) listNameFunctions(data[x,], ...)))
+
return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+                 drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
+                 data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"],
na.rm=T),
+                 sum(data[cdata,"A"], na.rm=T))))
+            } else {
+                return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+                 drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
+            }
+         }
+         result = lapplyFunctionRecurse(data, ...)
+         matrix2 <- do.call('rbind', lapply(result, function(x)
+         do.call('rbind',x)))
+         return(matrix2)
+ }>
> Rprof()
> dat <- data.frame(D=sample(32000:33000, 666000,+     T),Fid=sample(1:10,666000,T),
A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2Fast(dat, c("D",
"Fid"),+     function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));> proc.time()-temp   user  system elapsed
  23.44    7.37   30.86

> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions,
...){+    lapplyFunctionRecurse <- function(cdata, level=1, ...){
+        if(level==1){
+
+ return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
+ function(x) lapplyFunctionRecurse(x, level+1, ...)))
+        } else if (level==length(pivotColumns)) {
+            #this line is different. it essentially calls the
function you pass in
+            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+ drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
+        } else {
+            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+ drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
+        }
+    }
+    result = lapplyFunctionRecurse(data, ...)
+    matrix2 <- do.call('rbind', lapply(result, function(x)
+ do.call('rbind',x)))
+    return(matrix2)
+ }>
> dat <- data.frame(D=sample(32000:33000, 666000,
+ T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),+ function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));> proc.time()-temp   user  system elapsed
  24.06    7.38   31.50


If you run Rprof, most of the time is being spent accessing the
dataframe.  I would suggest that you convert the dataframe to a matrix
to get better performance.  Here is what I saw in the Rprof of the
first example:

  0  19.9 root
  1.   19.7 fedb.ddplyWrapper2Fast
  2. .   19.7 lapplyFunctionRecurse
  3. . .   19.7 lapply
  4. . . .   19.4 FUN
  5. . . . .   19.4 lapplyFunctionRecurse
  6. . . . . .   19.3 lapply
  7. . . . . . .   18.6 FUN
  8. . . . . . . .   18.6 listNameFunctions
  9. . . . . . . . .   18.5 [
 10. . . . . . . . . .   18.3 [.data.frame  <<- most of the time in
accessing the data within a data frame.
 11. . . . . . . . . . .   14.6 attr
 11. . . . . . . . . . .    0.5 %in%
 12. . . . . . . . . . . .    0.4 match
 13. . . . . . . . . . . . .    0.4 is.factor
 14. . . . . . . . . . . . . .    0.3 inherits
 11. . . . . . . . . . .    0.5 [[
 12. . . . . . . . . . . .    0.5 [[.data.frame
 13. . . . . . . . . . . . .    0.2 %in%
 14. . . . . . . . . . . . . .    0.2 match
 15. . . . . . . . . . . . . . .    0.1 is.factor
 16. . . . . . . . . . . . . . . .    0.1 inherits
 11. . . . . . . . . . .    0.4 anyDuplicated
 12. . . . . . . . . . . .    0.2 anyDuplicated.default
 11. . . . . . . . . . .    0.2 names
 12. . . . . . . . . . . .    0.2 names
 11. . . . . . . . . . .    0.1 vector
 12. . . . . . . . . . . .    0.1 length
 13. . . . . . . . . . . . .    0.1 length
  7. . . . . . .    0.7 is.vector
  8. . . . . . . .    0.7 split
  9. . . . . . . . .    0.6 split.default
 10. . . . . . . . . .    0.5 factor
 11. . . . . . . . . . .    0.2 as.character
 11. . . . . . . . . . .    0.1 unique
 12. . . . . . . . . . . .    0.1 unique.default
 10. . . . . . . . . .    0.2 [
 11. . . . . . . . . . .    0.1 [.data.frame
  4. . . .    0.4 is.vector
  5. . . . .    0.4 split
  6. . . . . .    0.4 split.default
  7. . . . . . .    0.4 factor
  8. . . . . . . .    0.3 as.character
  1.    0.1 data.frame



On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rforler at uchicago.edu>
wrote:> So I have a function that does lapply's for me based on dimension.
Currently
> only works for length(pivotColumns)=2 because I haven't fixed the
rbinds. I
> have two versions. One runs WAYYY faster than the other. And I'm not
sure
> why.
>
> Fast Version:
>
> fedb.ddplyWrapper2Fast <- function(data, pivotColumns,
listNameFunctions,
> ...){
> ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){
> ? ? ? ?if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?} else if (level==length(pivotColumns)) {
> ? ? ? ? ? ?#
> return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
> function(x, ...) listNameFunctions(data[x,], ...)))
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
> data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
> sum(data[cdata,"A"], na.rm=T))))
> ? ? ? ?} else {
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?}
> ? ?}
> ? ?result = lapplyFunctionRecurse(data, ...)
> ? ?matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> ? ?return(matrix2)
> }
>
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));
> proc.time()-temp
> ? user ?system elapsed
> ?4.616 ? 0.006 ? 4.630
> #note in thie case the anonymous function I pass in isn't used because
I
> hardcode the function into the lapply.
>
> approx 4 seconds
>
> This runs very fast. This runs very slow:
>
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions,
...){
> ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){
> ? ? ? ?if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?} else if (level==length(pivotColumns)) {
> ? ? ? ? ? ?#this line is different. it essentially calls the function you
> pass in
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
> ? ? ? ?} else {
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?}
> ? ?}
> ? ?result = lapplyFunctionRecurse(data, ...)
> ? ?matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> ? ?return(matrix2)
> }
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));
> proc.time()-temp
> ? user ?system elapsed
> ?16.346 ?65.059 ?81.680
>
>
>
> Can anyone explain to me why there is a 4x time difference? I don't
want to
> have to hardcore into the recursion function, but if I have to I will.
>
> Thanks,
> Rob
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Tom Short

2010-Feb-26 21:21 UTC

head link

[R] dramatic speed difference in lapply

I'm sorry, Rob, but that code is dense enough and formatted badly
enough that it's hard to dig through.

You may want to try the data.table package. The development version on
R-forge is pretty fast for grouping operations like this. I'm not sure
if this is what you're really after. It's hard to tell from your
example.

Compare some speeds:
> dat <- data.frame(D=sample(32000:33000, 666000,T),+                   Fid=sample(1:10,666000,T),
+                   A=sample(1:5,666000,T))>
> ####### one of your examples
> system.time(ret <- fedb.ddplyWrapper2(dat, c("D",
"Fid"),+                     function(x) c(sum(x[,"A"], na.rm=T),
sum(x[,"A"], na.rm=T))))
   user  system elapsed
  21.78   14.42   36.35>
>
> ####### data.table
>
install.packages("data.table",repos="http://R-Forge.R-project.org")
> library(data.table)
> dt <- as.data.table(dat)
> system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"])   user  system elapsed
   0.27    0.00    0.28>
>
> ####### plyr for comparison, too
> library(plyr)
> system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T)))   user  system elapsed
  28.94   12.16   41.23
> head(ret)  [,1] [,2]
1  175  175
2  222  222
3  221  221
4  134  134
5  253  253
6  194  194
> head(ret2)         D Fid  V1
[1,] 32000   1 228
[2,] 32000   2 209
[3,] 32000   3 182
[4,] 32000   4 180
[5,] 32000   5 181
[6,] 32000   6 222
> head(ret3)      D Fid  V1
1 32000   1 175
2 32000   2 222
3 32000   3 221
4 32000   4 134
5 32000   5 253
6 32000   6 194


- Tom


On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rforler at uchicago.edu>
wrote:> So I have a function that does lapply's for me based on dimension.
Currently
> only works for length(pivotColumns)=2 because I haven't fixed the
rbinds. I
> have two versions. One runs WAYYY faster than the other. And I'm not
sure
> why.
>
> Fast Version:
>
> fedb.ddplyWrapper2Fast <- function(data, pivotColumns,
listNameFunctions,
> ...){
> ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){
> ? ? ? ?if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?} else if (level==length(pivotColumns)) {
> ? ? ? ? ? ?#
> return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
> function(x, ...) listNameFunctions(data[x,], ...)))
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
> data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
> sum(data[cdata,"A"], na.rm=T))))
> ? ? ? ?} else {
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?}
> ? ?}
> ? ?result = lapplyFunctionRecurse(data, ...)
> ? ?matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> ? ?return(matrix2)
> }
>
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));
> proc.time()-temp
> ? user ?system elapsed
> ?4.616 ? 0.006 ? 4.630
> #note in thie case the anonymous function I pass in isn't used because
I
> hardcode the function into the lapply.
>
> approx 4 seconds
>
> This runs very fast. This runs very slow:
>
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions,
...){
> ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){
> ? ? ? ?if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?} else if (level==length(pivotColumns)) {
> ? ? ? ? ? ?#this line is different. it essentially calls the function you
> pass in
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
> ? ? ? ?} else {
> ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> ? ? ? ?}
> ? ?}
> ? ?result = lapplyFunctionRecurse(data, ...)
> ? ?matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> ? ?return(matrix2)
> }
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D",
"Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"],
na.rm=T)));
> proc.time()-temp
> ? user ?system elapsed
> ?16.346 ?65.059 ?81.680
>
> head(ret3)      D Fid  V1
1 32000   1 175
2 32000   2 222
3 32000   3 221
4 32000   4 134
5 32000   5 253
6 32000   6 194>
>
> Can anyone explain to me why there is a 4x time difference? I don't
want to
> have to hardcore into the recursion function, but if I have to I will.
>
> Thanks,
> Rob
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Maybe Matching Threads

Search for more possibly parallel threads

R help - Feb 2010 - dramatic speed difference in lapply

[R] dramatic speed difference in lapply

[R] dramatic speed difference in lapply

[R] dramatic speed difference in lapply

Maybe Matching Threads