Tahir Butt
2009-Nov-19 22:19 UTC
[R] Performance of 'by' and 'ddply' on a large data frame
I've only recently started using R. One of the problems I come up against is after having extracted a large dataset (>5M rows) out of database, I realize I need another variable. In this case I have data frame with dates. I want to find the minimum date for each value of x1 and add that minimum date to my data.frame.> randomdf <- function(p) {data.frame(x1=sample(1:10^4, 10^p, replace=T), x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by="day"), 10^p, replace=T), y1=sample(1:100, 10^p, replace=T)) }> testby <- function(p) {df <- randomdf(p) system.time(by(df, df$x1, function(dfi) { min(dfi$x2) })) }> lapply(c(1,2,3,4,5), testby)[[1]] ? user ?system elapsed ?0.006 ? 0.000 ? 0.006 [[2]] ? user ?system elapsed ?0.024 ? 0.000 ? 0.025 [[3]] ? user ?system elapsed ?0.233 ? 0.000 ? 0.234 [[4]] ? user ?system elapsed ?1.996 ? 0.026 ? 2.022 [[5]] ? user ?system elapsed 11.030 ? 0.000 ?11.032 Strangely enough, not sure why this is, the result of by with the min function is not date objects but instead integers representing days from an origin. Is there a min function that would return me a date instead of an integer? Or is this a result of using by? I also wanted to see how ddply compares.> testddply <- function(p) { pdf <- randomdf(p); system.time(ddply(pdf, .(x1), function(df) { return (data.frame(min(df$x2))) })) } > lapply(c(1,2,3,4,5), testddply)[[1]] user system elapsed 0.020 0.000 0.021 [[2]] user system elapsed 0.119 0.000 0.119 [[3]] user system elapsed 1.008 0.000 1.008 [[4]] user system elapsed 8.425 0.001 8.428 [[5]] user system elapsed 23.070 0.000 23.075 Once the data frame gets above 1M rows, the timings are a bit too long (on a previous run it went up to 8000s user time). This seems quite a bit slower than I expected. Maybe there's a better and faster way to add such variables to a data frame that are derived using some aggregation. Also, ddply seems to take twice as long as by. Are these two operations not equivalent? Thanks, Tahir
Tahir Butt
2009-Nov-20 20:04 UTC
[R] Performance of 'by' and 'ddply' on a large data frame
A faster solution using tapply was sent to me via email: testtapply = function(p){ df = randomdf(p) system.time({res = tapply(df$x2,df$x1,min); res = as.Date(res,origin=as.Date('1970-01-01')); df$mindate = res[as.character(df$x1)]}) } Thanks Phil! Tahir On Thu, Nov 19, 2009 at 5:19 PM, Tahir Butt <tahir.butt at gmail.com> wrote:> I've only recently started using R. One of the problems I come up > against is after having extracted a large dataset (>5M rows) out of > database, I realize I need another variable. In this case I have data > frame with dates. I want to find the minimum date for each value of x1 > and add that minimum date to my data.frame. > >> randomdf <- function(p) { > data.frame(x1=sample(1:10^4, 10^p, replace=T), > x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by="day"), 10^p, replace=T), > y1=sample(1:100, 10^p, replace=T)) > } >> testby <- function(p) { > df <- randomdf(p) > system.time(by(df, df$x1, function(dfi) { min(dfi$x2) })) > } >> lapply(c(1,2,3,4,5), testby) > [[1]] > ? user ?system elapsed > ?0.006 ? 0.000 ? 0.006 > > [[2]] > ? user ?system elapsed > ?0.024 ? 0.000 ? 0.025 > > [[3]] > ? user ?system elapsed > ?0.233 ? 0.000 ? 0.234 > > [[4]] > ? user ?system elapsed > ?1.996 ? 0.026 ? 2.022 > > [[5]] > ? user ?system elapsed > 11.030 ? 0.000 ?11.032 > > Strangely enough, not sure why this is, the result of by with the min > function is not date objects but instead integers representing days > from an origin. Is there a min function that would return me a date > instead of an integer? Or is this a result of using by? > > I also wanted to see how ddply compares. > >> testddply <- function(p) { pdf <- randomdf(p); system.time(ddply(pdf, .(x1), function(df) { return (data.frame(min(df$x2))) })) } >> lapply(c(1,2,3,4,5), testddply) > [[1]] > ? user ?system elapsed > ?0.020 ? 0.000 ? 0.021 > > [[2]] > ? user ?system elapsed > ?0.119 ? 0.000 ? 0.119 > > [[3]] > ? user ?system elapsed > ?1.008 ? 0.000 ? 1.008 > > [[4]] > ? user ?system elapsed > ?8.425 ? 0.001 ? 8.428 > > [[5]] > ? user ?system elapsed > ?23.070 ? 0.000 ?23.075 > > Once the data frame gets above 1M rows, the timings are a bit too long > (on a previous run it went up to 8000s user time). This seems quite a > bit slower than I expected. Maybe there's a better and faster way to > add such variables to a data frame that are derived using some > aggregation. > > Also, ddply seems to take twice as long as by. Are these two > operations not equivalent? > > Thanks, > Tahir >