thr3ads.net - R help - [R] slow computation of functions over large datasets [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Caroline Faisst

2011-Aug-03 13:25 UTC

[R] slow computation of functions over large datasets

Hello there,


I’m computing the total value of an order from the price of the order items
using a “for” loop and the “ifelse” function. I do this on a large dataframe
(close to 1m lines). The computation of this function is painfully slow: in
1min only about 90 rows are calculated.


The computation time taken for a given number of rows increases with the
size of the dataset, see the example with my function below:


# small dataset: function performs well

exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))

exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]

system.time(for (i in 2:length(exampledata[,1]))
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})


# large dataset: the very same computational task takes much longer

exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))

exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]

system.time(for (i in 2:9)
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})



Does someone know a way to increase the speed?


Thank you very much!

Caroline

	[[alternative HTML version deleted]]

ONKELINX, Thierry

2011-Aug-03 13:59 UTC

head link

[R] slow computation of functions over large datasets

Dear Caroline,

Here is a faster and more elegant solution.
> n <- 10000
> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace =
TRUE), itemPrice = rpois(n, 10))
> library(plyr)
> system.time({+ 	ddply(exampledata, .(orderID), function(x){
+ 		data.frame(itemPrice = x$itemPrice, orderAmount = cumsum(x$itemPrice))
+ 	})
+ })
   user  system elapsed 
   1.67    0.00    1.69 >
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
> system.time(for (i in 2:length(exampledata[,1]))+
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
   user  system elapsed 
  11.94    0.02   11.97

Best regards,

Thierry> -----Oorspronkelijk bericht-----
> Van: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org]
> Namens Caroline Faisst
> Verzonden: woensdag 3 augustus 2011 15:26
> Aan: r-help at r-project.org
> Onderwerp: [R] slow computation of functions over large datasets
> 
> Hello there,
> 
> 
> I'm computing the total value of an order from the price of the order
items using
> a "for" loop and the "ifelse" function. I do this on a
large dataframe (close to
> 1m lines). The computation of this function is painfully slow: in 1min only
about
> 90 rows are calculated.
> 
> 
> The computation time taken for a given number of rows increases with the
size
> of the dataset, see the example with my function below:
> 
> 
> # small dataset: function performs well
> 
> exampledata<-
>
data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
> 
>
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
> 
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<-
>
ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-
>
1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
> 
> 
> # large dataset: the very same computational task takes much longer
> 
> exampledata2<-
>
data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,1
> 0,1,9,7,25:2000020))
> 
>
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
> 
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<-
> ifelse(exampledata2[i,"orderID"]==exampledata2[i-
> 1,"orderID"],exampledata2[i-
>
1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
> 
> 
> 
> Does someone know a way to increase the speed?
> 
> 
> Thank you very much!
> 
> Caroline
> 
> 	[[alternative HTML version deleted]]

David Winsemius

2011-Aug-03 14:26 UTC

head link

[R] slow computation of functions over large datasets

On Aug 3, 2011, at 9:25 AM, Caroline Faisst wrote:
> Hello there,
>
>
> I?m computing the total value of an order from the price of the  
> order items
> using a ?for? loop and the ?ifelse? function.
Ouch. Schools really should stop teaching SAS and BASIC as a first  
language.
> I do this on a large dataframe
> (close to 1m lines). The computation of this function is painfully  
> slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with  
> the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
> exampledata<- 
> data 
> .frame 
> (orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
>
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
> {exampledata[i,"orderAmount"]<- 
> ifelse 
> (exampledata 
> [i 
> ,"orderID 
>
"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
>
+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
Try instead using 'ave' to calculate a cumulative 'sum' within  
"orderID":

exampledata$orderAmt <- with(exampledata,  ave(itemPrice, orderID,  
FUN=cumsum) )

I assure you this will be more reproducible,  faster, and  
understandable.
> # large dataset:
"medium" dataset really. Barely nudges the RAM dial on my machine.
> the very same computational task takes much longer
>
> exampledata2<- 
> data 
> .frame 
> (orderID 
> = 
> c 
> (1,1,1,2,2,3,3,3,4,5 
> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
>
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
> {exampledata2[i,"orderAmount"]<- 
> ifelse 
> (exampledata2 
> [i 
> ,"orderID 
>
"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
>
+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
> > system.time( exampledata2$orderAmt <- with(exampledata2,   
ave(itemPrice, orderID, FUN=cumsum) ) )
    user  system elapsed
  35.106   0.811  35.822

On a three year-old machine. Not as fast as I expected, but not long  
enough to require refilling the coffee cup either.

-- 
David.>
> Does someone know a way to increase the speed?
>
-- 

David Winsemius, MD
West Hartford, CT

jim holtman

2011-Aug-03 16:20 UTC

head link

[R] slow computation of functions over large datasets

This takes about 2 secs for 1M rows:
> n <- 1000000
> exampledata <- data.frame(orderID = sample(floor(n / 5), n, replace =
TRUE), itemPrice = rpois(n, 10))
> require(data.table)
> # convert to data.table
> ed.dt <- data.table(exampledata)
> system.time(result <- ed.dt[+                         , list(total = sum(itemPrice))
+                         , by = orderID
+                         ]
+            )
   user  system elapsed
   1.30    0.05    1.34>
> str(result)Classes ?data.table? and 'data.frame':  198708 obs. of  2 variables:
 $ orderID: int  1 2 3 4 5 6 8 9 10 11 ...
 $ total  : num  49 37 72 92 50 76 34 22 65 39 ...> head(result)     orderID total
[1,]       1    49
[2,]       2    37
[3,]       3    72
[4,]       4    92
[5,]       5    50
[6,]       6    76>

On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
<caroline.faisst at gmail.com> wrote:> Hello there,
>
>
> I?m computing the total value of an order from the price of the order items
> using a ?for? loop and the ?ifelse? function. I do this on a large
dataframe
> (close to 1m lines). The computation of this function is painfully slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
>
exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
>
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
>
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>
>
> # large dataset: the very same computational task takes much longer
>
>
exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
>
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
>
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
>
> Does someone know a way to increase the speed?
>
>
> Thank you very much!
>
> Caroline
>
> ? ? ? ?[[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Paul Hiemstra

2011-Aug-04 14:10 UTC

head link

[R] slow computation of functions over large datasets

Hi all,

After reading this interesting discussion I delved a bit deeper into the
subject matter. The following snippet of code (see the end of my mail)
compares three ways of performing this task, using ddply, ave and one
yet unmentioned option: data.table (a package). The piece of code
generates mock datasets which vary in size and number of factor levels
for the factor. The results look like this (there is also a ggplot plot
in the script that summarise the table):
> res   datsize noClasses   tave  tddply tdata.table
...note that I cut out part of the table for readability...
17   1e+07        10  9.160   3.500       1.064
18   1e+07        50 10.126   4.483       1.364
19   1e+07       100 10.485   5.016       1.407
20   1e+07       200 10.680   6.901       1.435
21   1e+07       500 10.801  12.569       1.474
22   1e+07      1000 10.923  21.001       1.540
23   1e+07      2500 11.514  51.020       1.622
24   1e+07     10000 12.158 182.752       1.737

It is clear that the option of using data.table is by far the fastest of
the three and scales quite nicely with the number of factor levels, in
contrast to ddply. It is also faster than ave by up to a factor of 10.

cheers,
Paul

library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5, 10e6)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
              cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
  expdataDT = data.table(expdata)
  t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum)))
  t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val sum(value)))
  t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
  return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3]))
}, .progress = 'text')
res

ggplot(aes(x = noClasses, y = log(value), color = variable),
        data = melt(res, id.vars =
c("datsize","noClasses"))) +
    facet_wrap(~ datsize) + geom_line()
> sessionInfo()R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods 
[8] base    

other attached packages:
[1] data.table_1.6.3 ggplot2_0.8.9    proto_0.3-8      reshape_0.8.4  
[5] plyr_1.5.2       fortunes_1.4-1 

loaded via a namespace (and not attached):
[1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0


On 08/03/2011 01:25 PM, Caroline Faisst wrote:> Hello there,
>
>
> I'm computing the total value of an order from the price of the order
items
> using a "for" loop and the "ifelse" function. I do this
on a large dataframe
> (close to 1m lines). The computation of this function is painfully slow: in
> 1min only about 90 rows are calculated.
>
>
> The computation time taken for a given number of rows increases with the
> size of the dataset, see the example with my function below:
>
>
> # small dataset: function performs well
>
>
exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>
>
exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>
> system.time(for (i in 2:length(exampledata[,1]))
>
{exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>
>
> # large dataset: the very same computational task takes much longer
>
>
exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>
>
exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>
> system.time(for (i in 2:9)
>
{exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>
>
>
> Does someone know a way to increase the speed?
>
>
> Thank you very much!
>
> Caroline
>
> 	[[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770


	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more reasonably related threads

R help - Aug 2011 - slow computation of functions over large datasets

[R] slow computation of functions over large datasets

[R] slow computation of functions over large datasets

[R] slow computation of functions over large datasets

[R] slow computation of functions over large datasets

[R] slow computation of functions over large datasets

Reasonably Related Threads