thr3ads.net - R help - [R] aggregate slow with variables of type 'dates'

If this information is useful, please help other people find it:
Share via:

Christoph Lehmann

2005-Apr-15 23:22 UTC

[R] aggregate slow with variables of type 'dates' - how to solve

Dear all
I use aggregate with variables of type numeric and dates. For type numeric  
functions, such as sum() are very fast, but similar simple functions, such 
as min() are much slower for the variables of type 'dates'. The
difference
gets bigger the larger the 'id' var is - but see this sample code:

dts <- dates(c("02/27/92", "02/27/92",
"01/14/92",
               "02/28/92", "02/01/92"))
ntimes <- 700000
dts <- data.frame(rep(c(1:40), ntimes/8), 
                  chron(rep(dts, ntimes), format = c(dates =
"m/d/y")),
                  rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes))
names(dts) <- c("id", "date", "tbs")


date()
dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x
dat.1st <- chron(dat.1st, format = c(dates = "m/d/y"))     
dat.1st
date() #82 seconds


date()
tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum)
tbs.s
date() #17 seconds

--- is it a problem of data-type 'dates' ? if yes, is there any solution
to solve this, since for huge data-sets, this can be a problem...

as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the 
two times are roughly the same, but with the 40 different ids, we have 
this big difference

thanks a lot

Christoph

--

Gabor Grothendieck

2005-Apr-16 04:07 UTC

head link

[R] aggregate slow with variables of type 'dates' - how to solve

On 4/15/05, Christoph Lehmann <christoph.lehmann at gmx.ch>
wrote:> Dear all
> I use aggregate with variables of type numeric and dates. For type numeric
> functions, such as sum() are very fast, but similar simple functions, such
> as min() are much slower for the variables of type 'dates'. The
difference
> gets bigger the larger the 'id' var is - but see this sample code:
> 
> dts <- dates(c("02/27/92", "02/27/92",
"01/14/92",
>               "02/28/92", "02/01/92"))
> ntimes <- 700000
> dts <- data.frame(rep(c(1:40), ntimes/8),
>                  chron(rep(dts, ntimes), format = c(dates =
"m/d/y")),
>                  rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes))
> names(dts) <- c("id", "date", "tbs")
> 
> date()
> dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x
> dat.1st <- chron(dat.1st, format = c(dates = "m/d/y"))
> dat.1st
> date() #82 seconds
> 
> date()
> tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum)
> tbs.s
> date() #17 seconds
> 
> --- is it a problem of data-type 'dates' ? if yes, is there any
solution
> to solve this, since for huge data-sets, this can be a problem...
> 
> as I mentioned, e.g. if we have for variable 'id' eg just 5 levels,
the
> two times are roughly the same, but with the 40 different ids, we have
> this big difference
> 
> thanks a lot
> 
> Christoph
> 
> --
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Gabor Grothendieck

2005-Apr-16 04:13 UTC

head link

[R] aggregate slow with variables of type 'dates' - how to solve

On 4/15/05, Christoph Lehmann <christoph.lehmann at gmx.ch>
wrote:> Dear all
> I use aggregate with variables of type numeric and dates. For type numeric
> functions, such as sum() are very fast, but similar simple functions, such
> as min() are much slower for the variables of type 'dates'. The
difference
> gets bigger the larger the 'id' var is - but see this sample code:
> 
> dts <- dates(c("02/27/92", "02/27/92",
"01/14/92",
>               "02/28/92", "02/01/92"))
> ntimes <- 700000
> dts <- data.frame(rep(c(1:40), ntimes/8),
>                  chron(rep(dts, ntimes), format = c(dates =
"m/d/y")),
>                  rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes))
> names(dts) <- c("id", "date", "tbs")
> 
> date()
> dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x
> dat.1st <- chron(dat.1st, format = c(dates = "m/d/y"))
> dat.1st
> date() #82 seconds
> 
> date()
> tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum)
> tbs.s
> date() #17 seconds
> 
> --- is it a problem of data-type 'dates' ? if yes, is there any
solution
> to solve this, since for huge data-sets, this can be a problem...
> 
> as I mentioned, e.g. if we have for variable 'id' eg just 5 levels,
the
> two times are roughly the same, but with the 40 different ids, we have
> this big difference
Just convert the dates to numeric first.  You are converting 
them back anyways.
> system.time({+ dat.1st <- chron(aggregate(dts$date, list(id = dts$id), min)$x)
+ }, TRUE)
[1] 0.86 0.00 0.86   NA   NA


 > system.time({+ dat.1st.2 <- chron(aggregate(as.numeric(dts$date), list(id = dts$id),
min)$x)
+ }, TRUE)
[1] 0.12 0.00 0.12   NA   NA> 
> identical(dat.1st, dat.1st.2)
[1] TRUE>

Apparently Analagous Threads

Search for more reasonably related threads

R help - Apr 2005 - aggregate slow with variables of type 'dates' - how to solve

[R] aggregate slow with variables of type 'dates' - how to solve

[R] aggregate slow with variables of type 'dates' - how to solve

[R] aggregate slow with variables of type 'dates' - how to solve

Apparently Analagous Threads