So I have a function that does lapply's for me based on dimension. Currently only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I have two versions. One runs WAYYY faster than the other. And I'm not sure why. Fast Version: fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, ...){ lapplyFunctionRecurse <- function(cdata, level=1, ...){ if(level==1){ return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) } else if (level==length(pivotColumns)) { # return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), function(x, ...) listNameFunctions(data[x,], ...))) return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), sum(data[cdata,"A"], na.rm=T)))) } else { return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) } } result = lapplyFunctionRecurse(data, ...) matrix2 <- do.call('rbind', lapply(result, function(x) do.call('rbind',x))) return(matrix2) } dat <- data.frame(D=sample(32000:33000, 666000, T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); proc.time()-temp user system elapsed 4.616 0.006 4.630 #note in thie case the anonymous function I pass in isn't used because I hardcode the function into the lapply. approx 4 seconds This runs very fast. This runs very slow: fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){ lapplyFunctionRecurse <- function(cdata, level=1, ...){ if(level==1){ return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) } else if (level==length(pivotColumns)) { #this line is different. it essentially calls the function you pass in return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), function(x, ...) listNameFunctions(data[x,], ...))) } else { return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) } } result = lapplyFunctionRecurse(data, ...) matrix2 <- do.call('rbind', lapply(result, function(x) do.call('rbind',x))) return(matrix2) } dat <- data.frame(D=sample(32000:33000, 666000, T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); proc.time()-temp user system elapsed 16.346 65.059 81.680 Can anyone explain to me why there is a 4x time difference? I don't want to have to hardcore into the recursion function, but if I have to I will. Thanks, Rob [[alternative HTML version deleted]]
On my computer your two examples seem to execute about the same:> fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,+ ...){ + lapplyFunctionRecurse <- function(cdata, level=1, ...){ + if(level==1){ + + return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), + function(x) lapplyFunctionRecurse(x, level+1, ...))) + } else if (level==length(pivotColumns)) { + # + return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), + function(x, ...) listNameFunctions(data[x,], ...))) + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], + data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), + sum(data[cdata,"A"], na.rm=T)))) + } else { + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) + } + } + result = lapplyFunctionRecurse(data, ...) + matrix2 <- do.call('rbind', lapply(result, function(x) + do.call('rbind',x))) + return(matrix2) + }> > Rprof() > dat <- data.frame(D=sample(32000:33000, 666000,+ T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2Fast(dat, c("D", "Fid"),+ function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));> proc.time()-tempuser system elapsed 23.44 7.37 30.86> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){+ lapplyFunctionRecurse <- function(cdata, level=1, ...){ + if(level==1){ + + return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), + function(x) lapplyFunctionRecurse(x, level+1, ...))) + } else if (level==length(pivotColumns)) { + #this line is different. it essentially calls the function you pass in + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x, ...) listNameFunctions(data[x,], ...))) + } else { + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) + } + } + result = lapplyFunctionRecurse(data, ...) + matrix2 <- do.call('rbind', lapply(result, function(x) + do.call('rbind',x))) + return(matrix2) + }> > dat <- data.frame(D=sample(32000:33000, 666000,+ T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),+ function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));> proc.time()-tempuser system elapsed 24.06 7.38 31.50 If you run Rprof, most of the time is being spent accessing the dataframe. I would suggest that you convert the dataframe to a matrix to get better performance. Here is what I saw in the Rprof of the first example: 0 19.9 root 1. 19.7 fedb.ddplyWrapper2Fast 2. . 19.7 lapplyFunctionRecurse 3. . . 19.7 lapply 4. . . . 19.4 FUN 5. . . . . 19.4 lapplyFunctionRecurse 6. . . . . . 19.3 lapply 7. . . . . . . 18.6 FUN 8. . . . . . . . 18.6 listNameFunctions 9. . . . . . . . . 18.5 [ 10. . . . . . . . . . 18.3 [.data.frame <<- most of the time in accessing the data within a data frame. 11. . . . . . . . . . . 14.6 attr 11. . . . . . . . . . . 0.5 %in% 12. . . . . . . . . . . . 0.4 match 13. . . . . . . . . . . . . 0.4 is.factor 14. . . . . . . . . . . . . . 0.3 inherits 11. . . . . . . . . . . 0.5 [[ 12. . . . . . . . . . . . 0.5 [[.data.frame 13. . . . . . . . . . . . . 0.2 %in% 14. . . . . . . . . . . . . . 0.2 match 15. . . . . . . . . . . . . . . 0.1 is.factor 16. . . . . . . . . . . . . . . . 0.1 inherits 11. . . . . . . . . . . 0.4 anyDuplicated 12. . . . . . . . . . . . 0.2 anyDuplicated.default 11. . . . . . . . . . . 0.2 names 12. . . . . . . . . . . . 0.2 names 11. . . . . . . . . . . 0.1 vector 12. . . . . . . . . . . . 0.1 length 13. . . . . . . . . . . . . 0.1 length 7. . . . . . . 0.7 is.vector 8. . . . . . . . 0.7 split 9. . . . . . . . . 0.6 split.default 10. . . . . . . . . . 0.5 factor 11. . . . . . . . . . . 0.2 as.character 11. . . . . . . . . . . 0.1 unique 12. . . . . . . . . . . . 0.1 unique.default 10. . . . . . . . . . 0.2 [ 11. . . . . . . . . . . 0.1 [.data.frame 4. . . . 0.4 is.vector 5. . . . . 0.4 split 6. . . . . . 0.4 split.default 7. . . . . . . 0.4 factor 8. . . . . . . . 0.3 as.character 1. 0.1 data.frame On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rforler at uchicago.edu> wrote:> So I have a function that does lapply's for me based on dimension. Currently > only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I > have two versions. One runs WAYYY faster than the other. And I'm not sure > why. > > Fast Version: > > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, > ...){ > ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){ > ? ? ? ?if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} else if (level==length(pivotColumns)) { > ? ? ? ? ? ?# > return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), > function(x, ...) listNameFunctions(data[x,], ...))) > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], > data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), > sum(data[cdata,"A"], na.rm=T)))) > ? ? ? ?} else { > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} > ? ?} > ? ?result = lapplyFunctionRecurse(data, ...) > ? ?matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > ? ?return(matrix2) > } > > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > ? user ?system elapsed > ?4.616 ? 0.006 ? 4.630 > #note in thie case the anonymous function I pass in isn't used because I > hardcode the function into the lapply. > > approx 4 seconds > > This runs very fast. This runs very slow: > > fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){ > ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){ > ? ? ? ?if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} else if (level==length(pivotColumns)) { > ? ? ? ? ? ?#this line is different. it essentially calls the function you > pass in > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) listNameFunctions(data[x,], ...))) > ? ? ? ?} else { > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} > ? ?} > ? ?result = lapplyFunctionRecurse(data, ...) > ? ?matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > ? ?return(matrix2) > } > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > ? user ?system elapsed > ?16.346 ?65.059 ?81.680 > > > > Can anyone explain to me why there is a 4x time difference? I don't want to > have to hardcore into the recursion function, but if I have to I will. > > Thanks, > Rob > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
I'm sorry, Rob, but that code is dense enough and formatted badly enough that it's hard to dig through. You may want to try the data.table package. The development version on R-forge is pretty fast for grouping operations like this. I'm not sure if this is what you're really after. It's hard to tell from your example. Compare some speeds:> dat <- data.frame(D=sample(32000:33000, 666000,T),+ Fid=sample(1:10,666000,T), + A=sample(1:5,666000,T))> > ####### one of your examples > system.time(ret <- fedb.ddplyWrapper2(dat, c("D", "Fid"),+ function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)))) user system elapsed 21.78 14.42 36.35> > > ####### data.table > install.packages("data.table",repos="R-Forge.R-project.org") > library(data.table) > dt <- as.data.table(dat) > system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"])user system elapsed 0.27 0.00 0.28> > > ####### plyr for comparison, too > library(plyr) > system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T)))user system elapsed 28.94 12.16 41.23> head(ret)[,1] [,2] 1 175 175 2 222 222 3 221 221 4 134 134 5 253 253 6 194 194> head(ret2)D Fid V1 [1,] 32000 1 228 [2,] 32000 2 209 [3,] 32000 3 182 [4,] 32000 4 180 [5,] 32000 5 181 [6,] 32000 6 222> head(ret3)D Fid V1 1 32000 1 175 2 32000 2 222 3 32000 3 221 4 32000 4 134 5 32000 5 253 6 32000 6 194 - Tom On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rforler at uchicago.edu> wrote:> So I have a function that does lapply's for me based on dimension. Currently > only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I > have two versions. One runs WAYYY faster than the other. And I'm not sure > why. > > Fast Version: > > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, > ...){ > ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){ > ? ? ? ?if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} else if (level==length(pivotColumns)) { > ? ? ? ? ? ?# > return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), > function(x, ...) listNameFunctions(data[x,], ...))) > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], > data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), > sum(data[cdata,"A"], na.rm=T)))) > ? ? ? ?} else { > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} > ? ?} > ? ?result = lapplyFunctionRecurse(data, ...) > ? ?matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > ? ?return(matrix2) > } > > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > ? user ?system elapsed > ?4.616 ? 0.006 ? 4.630 > #note in thie case the anonymous function I pass in isn't used because I > hardcode the function into the lapply. > > approx 4 seconds > > This runs very fast. This runs very slow: > > fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){ > ? ?lapplyFunctionRecurse <- function(cdata, level=1, ...){ > ? ? ? ?if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} else if (level==length(pivotColumns)) { > ? ? ? ? ? ?#this line is different. it essentially calls the function you > pass in > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) listNameFunctions(data[x,], ...))) > ? ? ? ?} else { > ? ? ? ? ? ?return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > ? ? ? ?} > ? ?} > ? ?result = lapplyFunctionRecurse(data, ...) > ? ?matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > ? ?return(matrix2) > } > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > ? user ?system elapsed > ?16.346 ?65.059 ?81.680 > > head(ret3)D Fid V1 1 32000 1 175 2 32000 2 222 3 32000 3 221 4 32000 4 134 5 32000 5 253 6 32000 6 194> > > Can anyone explain to me why there is a 4x time difference? I don't want to > have to hardcore into the recursion function, but if I have to I will. > > Thanks, > Rob > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >