Caroline Faisst
2011-Aug-03 13:25 UTC
[R] slow computation of functions over large datasets
Hello there, I’m computing the total value of an order from the price of the order items using a “for” loop and the “ifelse” function. I do this on a large dataframe (close to 1m lines). The computation of this function is painfully slow: in 1min only about 90 rows are calculated. The computation time taken for a given number of rows increases with the size of the dataset, see the example with my function below: # small dataset: function performs well exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] system.time(for (i in 2:length(exampledata[,1])) {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])}) # large dataset: the very same computational task takes much longer exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020)) exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"] system.time(for (i in 2:9) {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])}) Does someone know a way to increase the speed? Thank you very much! Caroline [[alternative HTML version deleted]]
ONKELINX, Thierry
2011-Aug-03 13:59 UTC
[R] slow computation of functions over large datasets
Dear Caroline, Here is a faster and more elegant solution.> n <- 10000 > exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) > library(plyr) > system.time({+ ddply(exampledata, .(orderID), function(x){ + data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice)) + }) + }) user system elapsed 1.67 0.00 1.69> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] > system.time(for (i in 2:length(exampledata[,1]))+ {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])}) user system elapsed 11.94 0.02 11.97 Best regards, Thierry> -----Oorspronkelijk bericht----- > Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] > Namens Caroline Faisst > Verzonden: woensdag 3 augustus 2011 15:26 > Aan: r-help at r-project.org > Onderwerp: [R] slow computation of functions over large datasets > > Hello there, > > > I'm computing the total value of an order from the price of the order items using > a "for" loop and the "ifelse" function. I do this on a large dataframe (close to > 1m lines). The computation of this function is painfully slow: in 1min only about > 90 rows are calculated. > > > The computation time taken for a given number of rows increases with the size > of the dataset, see the example with my function below: > > > # small dataset: function performs well > > exampledata<- > data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) > > exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] > > system.time(for (i in 2:length(exampledata[,1])) > {exampledata[i,"orderAmount"]<- > ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i- > 1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])}) > > > # large dataset: the very same computational task takes much longer > > exampledata2<- > data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1 > 0,1,9,7,25:2000020)) > > exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"] > > system.time(for (i in 2:9) > {exampledata2[i,"orderAmount"]<- > ifelse(exampledata2[i,"orderID"]==exampledata2[i- > 1,"orderID"],exampledata2[i- > 1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])}) > > > > Does someone know a way to increase the speed? > > > Thank you very much! > > Caroline > > [[alternative HTML version deleted]]
David Winsemius
2011-Aug-03 14:26 UTC
[R] slow computation of functions over large datasets
On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:> Hello there, > > > I?m computing the total value of an order from the price of the > order items > using a ?for? loop and the ?ifelse? function.Ouch. Schools really should stop teaching SAS and BASIC as a first language.> I do this on a large dataframe > (close to 1m lines). The computation of this function is painfully > slow: in > 1min only about 90 rows are calculated. > > > The computation time taken for a given number of rows increases with > the > size of the dataset, see the example with my function below: > > > # small dataset: function performs well > > exampledata<- > data > .frame > (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) > > exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] > > system.time(for (i in 2:length(exampledata[,1])) > {exampledata[i,"orderAmount"]<- > ifelse > (exampledata > [i > ,"orderID > "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"] > +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})Try instead using 'ave' to calculate a cumulative 'sum' within "orderID": exampledata$orderAmt <- with(exampledata, ave(itemPrice, orderID, FUN=cumsum) ) I assure you this will be more reproducible, faster, and understandable.> # large dataset:"medium" dataset really. Barely nudges the RAM dial on my machine.> the very same computational task takes much longer > > exampledata2<- > data > .frame > (orderID > = > c > (1,1,1,2,2,3,3,3,4,5 > :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020)) > > exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"] > > system.time(for (i in 2:9) > {exampledata2[i,"orderAmount"]<- > ifelse > (exampledata2 > [i > ,"orderID > "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"] > +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])}) > >> system.time( exampledata2$orderAmt <- with(exampledata2, ave(itemPrice, orderID, FUN=cumsum) ) ) user system elapsed 35.106 0.811 35.822 On a three year-old machine. Not as fast as I expected, but not long enough to require refilling the coffee cup either. -- David.> > Does someone know a way to increase the speed? >-- David Winsemius, MD West Hartford, CT
This takes about 2 secs for 1M rows:> n <- 1000000 > exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace = TRUE), itemPrice = rpois(n, 10)) > require(data.table) > # convert to data.table > ed.dt <- data.table(exampledata) > system.time(result <- ed.dt[+ , list(total = sum(itemPrice)) + , by = orderID + ] + ) user system elapsed 1.30 0.05 1.34> > str(result)Classes ?data.table? and 'data.frame': 198708 obs. of 2 variables: $ orderID: int 1 2 3 4 5 6 8 9 10 11 ... $ total : num 49 37 72 92 50 76 34 22 65 39 ...> head(result)orderID total [1,] 1 49 [2,] 2 37 [3,] 3 72 [4,] 4 92 [5,] 5 50 [6,] 6 76>On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst <caroline.faisst at gmail.com> wrote:> Hello there, > > > I?m computing the total value of an order from the price of the order items > using a ?for? loop and the ?ifelse? function. I do this on a large dataframe > (close to 1m lines). The computation of this function is painfully slow: in > 1min only about 90 rows are calculated. > > > The computation time taken for a given number of rows increases with the > size of the dataset, see the example with my function below: > > > # small dataset: function performs well > > exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) > > exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] > > system.time(for (i in 2:length(exampledata[,1])) > {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])}) > > > # large dataset: the very same computational task takes much longer > > exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020)) > > exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"] > > system.time(for (i in 2:9) > {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])}) > > > > Does someone know a way to increase the speed? > > > Thank you very much! > > Caroline > > ? ? ? ?[[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Hi all, After reading this interesting discussion I delved a bit deeper into the subject matter. The following snippet of code (see the end of my mail) compares three ways of performing this task, using ddply, ave and one yet unmentioned option: data.table (a package). The piece of code generates mock datasets which vary in size and number of factor levels for the factor. The results look like this (there is also a ggplot plot in the script that summarise the table):> resdatsize noClasses tave tddply tdata.table ...note that I cut out part of the table for readability... 17 1e+07 10 9.160 3.500 1.064 18 1e+07 50 10.126 4.483 1.364 19 1e+07 100 10.485 5.016 1.407 20 1e+07 200 10.680 6.901 1.435 21 1e+07 500 10.801 12.569 1.474 22 1e+07 1000 10.923 21.001 1.540 23 1e+07 2500 11.514 51.020 1.622 24 1e+07 10000 12.158 182.752 1.737 It is clear that the option of using data.table is by far the fastest of the three and scales quite nicely with the number of factor levels, in contrast to ddply. It is also faster than ave by up to a factor of 10. cheers, Paul library(ggplot2) library(data.table) theme_set(theme_bw()) datsize = c(10e4, 10e5, 10e6) noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3) comb = expand.grid(datsize = datsize, noClasses = noClasses) res = ddply(comb, .(datsize, noClasses), function(x) { expdata = data.frame(value = runif(x$datsize), cat = round(runif(x$datsize, min = 0, max = x$noClasses))) expdataDT = data.table(expdata) t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum))) t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val sum(value))) t3 = system.time(res3 <- expdataDT[, sum(value), by = cat]) return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3])) }, .progress = 'text') res ggplot(aes(x = noClasses, y = log(value), color = variable), data = melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize) + geom_line()> sessionInfo()R version 2.13.0 (2011-04-13) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] data.table_1.6.3 ggplot2_0.8.9 proto_0.3-8 reshape_0.8.4 [5] plyr_1.5.2 fortunes_1.4-1 loaded via a namespace (and not attached): [1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0 On 08/03/2011 01:25 PM, Caroline Faisst wrote:> Hello there, > > > I'm computing the total value of an order from the price of the order items > using a "for" loop and the "ifelse" function. I do this on a large dataframe > (close to 1m lines). The computation of this function is painfully slow: in > 1min only about 90 rows are calculated. > > > The computation time taken for a given number of rows increases with the > size of the dataset, see the example with my function below: > > > # small dataset: function performs well > > exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) > > exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] > > system.time(for (i in 2:length(exampledata[,1])) > {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])}) > > > # large dataset: the very same computational task takes much longer > > exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020)) > > exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"] > > system.time(for (i in 2:9) > {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])}) > > > > Does someone know a way to increase the speed? > > > Thank you very much! > > Caroline > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Paul Hiemstra, Ph.D. Global Climate Division Royal Netherlands Meteorological Institute (KNMI) Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 P.O. Box 201 | 3730 AE | De Bilt tel: +31 30 2206 494 http://intamap.geo.uu.nl/~paul http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770 [[alternative HTML version deleted]]