Christoph Lehmann
2005-Apr-15 23:22 UTC
[R] aggregate slow with variables of type 'dates' - how to solve
Dear all I use aggregate with variables of type numeric and dates. For type numeric functions, such as sum() are very fast, but similar simple functions, such as min() are much slower for the variables of type 'dates'. The difference gets bigger the larger the 'id' var is - but see this sample code: dts <- dates(c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")) ntimes <- 700000 dts <- data.frame(rep(c(1:40), ntimes/8), chron(rep(dts, ntimes), format = c(dates = "m/d/y")), rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes)) names(dts) <- c("id", "date", "tbs") date() dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x dat.1st <- chron(dat.1st, format = c(dates = "m/d/y")) dat.1st date() #82 seconds date() tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum) tbs.s date() #17 seconds --- is it a problem of data-type 'dates' ? if yes, is there any solution to solve this, since for huge data-sets, this can be a problem... as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the two times are roughly the same, but with the 40 different ids, we have this big difference thanks a lot Christoph --
Gabor Grothendieck
2005-Apr-16 04:07 UTC
[R] aggregate slow with variables of type 'dates' - how to solve
On 4/15/05, Christoph Lehmann <christoph.lehmann at gmx.ch> wrote:> Dear all > I use aggregate with variables of type numeric and dates. For type numeric > functions, such as sum() are very fast, but similar simple functions, such > as min() are much slower for the variables of type 'dates'. The difference > gets bigger the larger the 'id' var is - but see this sample code: > > dts <- dates(c("02/27/92", "02/27/92", "01/14/92", > "02/28/92", "02/01/92")) > ntimes <- 700000 > dts <- data.frame(rep(c(1:40), ntimes/8), > chron(rep(dts, ntimes), format = c(dates = "m/d/y")), > rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes)) > names(dts) <- c("id", "date", "tbs") > > date() > dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x > dat.1st <- chron(dat.1st, format = c(dates = "m/d/y")) > dat.1st > date() #82 seconds > > date() > tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum) > tbs.s > date() #17 seconds > > --- is it a problem of data-type 'dates' ? if yes, is there any solution > to solve this, since for huge data-sets, this can be a problem... > > as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the > two times are roughly the same, but with the 40 different ids, we have > this big difference > > thanks a lot > > Christoph > > -- > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Gabor Grothendieck
2005-Apr-16 04:13 UTC
[R] aggregate slow with variables of type 'dates' - how to solve
On 4/15/05, Christoph Lehmann <christoph.lehmann at gmx.ch> wrote:> Dear all > I use aggregate with variables of type numeric and dates. For type numeric > functions, such as sum() are very fast, but similar simple functions, such > as min() are much slower for the variables of type 'dates'. The difference > gets bigger the larger the 'id' var is - but see this sample code: > > dts <- dates(c("02/27/92", "02/27/92", "01/14/92", > "02/28/92", "02/01/92")) > ntimes <- 700000 > dts <- data.frame(rep(c(1:40), ntimes/8), > chron(rep(dts, ntimes), format = c(dates = "m/d/y")), > rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes)) > names(dts) <- c("id", "date", "tbs") > > date() > dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x > dat.1st <- chron(dat.1st, format = c(dates = "m/d/y")) > dat.1st > date() #82 seconds > > date() > tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum) > tbs.s > date() #17 seconds > > --- is it a problem of data-type 'dates' ? if yes, is there any solution > to solve this, since for huge data-sets, this can be a problem... > > as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the > two times are roughly the same, but with the 40 different ids, we have > this big differenceJust convert the dates to numeric first. You are converting them back anyways.> system.time({+ dat.1st <- chron(aggregate(dts$date, list(id = dts$id), min)$x) + }, TRUE) [1] 0.86 0.00 0.86 NA NA> system.time({+ dat.1st.2 <- chron(aggregate(as.numeric(dts$date), list(id = dts$id), min)$x) + }, TRUE) [1] 0.12 0.00 0.12 NA NA> > identical(dat.1st, dat.1st.2)[1] TRUE>
Apparently Analagous Threads
- prediction in a loop with only one sample
- MCMC coding problem
- Fwd: Recall function: "evaluation nested too deeply: infinite recursion / options(expressions=)?"
- having trouble extracting week from chron object
- question about the aggregate function with respect to order of levels of grouping elements