Sklyar, Oleg (MI London)
2008-Apr-10 13:51 UTC
[Rd] ISOdate/ISOdatetime performance suggestions, other date/time questions
Dear list: working with date/times I have come across a problem that ISOdate and ISOdatetime are too slow on large vectors of data. I was surprised just until I looked at the implementation and the man page: "ISOdatetime and ISOdate are convenience wrappers for strptime". In other terms, they convert data to character representation first in order to create a POSIXlt object that is then converted to POSIXct. And POSIXct, i.e. the number of seconds since 1970, is really what one wants most often. Obviously this is not a bug, but it is really a suboptimal implementation of a pretty important function as the example below shows. Now my questions are: - any chance that the implementation can be changed in R (suggested, well tz needs to be added)? - is there a better pure-R (no-C) way than that shown below to convert to POSIXct? - any idea why in the example below fooling R into thinking a list is POSIXlt is faster than just creating a POSIXlt by rep or seq? It's not a huge difference, but still. Unfortunately seq on POSIXlt returns POSIXct anyway, so the class of 'origin' is set correctly. - any idea why seq is faster than rep when applied on POSIXct? There is hardly anything simpler than on double values... Thanks in advance for your comments, Oleg It's common in finance to work with time stamps stored in a form like %Y%m%d.%H%M%OS, e.g. 20080410.140444 for now, this is what 'ts' in the example below is: ts = 1e4*trunc(rnorm(50000,2008,2)) + 1e2*trunc(runif(50000,1,12)) + trunc(runif(50000,1,28)) + 1e-2*trunc(runif(50000,1,24)) + 1e-4*trunc(runif(50000,1,60)) + 1e-6*runif(50000,1,60) posix.viaISOdate = function(x) { date = trunc(x at .Data) time = round(1e6*x at .Data%%1,2) rtime = round(time) z = list(sec=rtime%%1e2 + time%%1, min=(rtime%/%1e2)%%1e2, hour=rtime%/%1e4, mday=date%%100, mon=(date%/%100)%%100, year=date%/%10000) ISOdate(z$year,z$mon,z$mday,z$hour,z$min,z$sec) # to POSIXct } ## This is just a test of how is it faster to create a long POSIXlt object ## before another implementations are given origin = as.POSIXct("1970-01-01") mean(sapply(1:25,function(i) system.time( as.POSIXlt(rep(origin,600000)) ))[1,]) # [1] 0.3972 mean(sapply(1:25,function(i) system.time( as.POSIXlt(seq(origin, origin, length.out=600000)) ))[1,]) # [1] 0.30528 posix.viaPOSIXlt1 = function(x) { origin = as.POSIXct("1970-01-01") z = as.POSIXlt(seq(origin, origin, length.out=length(x))) date = trunc(x at .Data) time = round(1e6*x at .Data%%1,2) rtime = round(time) z$sec=rtime%%1e2 + time%%1 z$min=(rtime%/%1e2)%%1e2 z$hour=rtime%/%1e4 z$mday=date%%100 z$mon=(date%/%100)%%100-1 z$year=date%/%10000-1900 as.double(z) # to POSIXct } posix.vialist = function(x) { date = trunc(x at .Data) time = round(1e6*x at .Data%%1,2) rtime = round(time) na = rep(0.0,length(x)) z = list(sec=rtime%%1e2 + time%%1, min=(rtime%/%1e2)%%1e2, hour=rtime%/%1e4, mday=date%%100, mon=(date%/%100)%%100-1, year=date%/%10000-1900, wday=na,yday=na,isdst=na) class(z) = c("POSIXt","POSIXlt") as.double(z) # to POSIXct } v1 = posix.viaISOdate(ts) v2 = posix.viaPOSIXlt1(ts) v3 = posix.vialist(ts) all(v1==v2 & v2==v3) # [1] TRUE mean(sapply(1:25,function(i) system.time( system.time(posix.viaISOdate(ts)) ))[1,]) # [1] 1.54244 mean(sapply(1:25,function(i) system.time( system.time(posix.viaPOSIXlt1(ts)) ))[1,]) # [1] 0.37624 mean(sapply(1:25,function(i) system.time( system.time(posix.vialist(ts)) ))[1,]) # [1] 0.35488 sessionInfo() R version 2.6.2 (2008-02-08) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=C;LC_MO NETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAMEC;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATI ON=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] rcompgen_0.1-17 Dr Oleg Sklyar Technology Group Man Investments Ltd +44 (0)20 7144 3803 osklyar at maninvestments.com ********************************************************************** The contents of this email are for the named addressee(s) only. It contains information which may be confidential and privileged. If you are not the intended recipient, please notify the sender immediately, destroy this email and any attachments and do not otherwise disclose or use them. Email transmission is not a secure method of communication and Man Investments cannot accept responsibility for the completeness or accuracy of this email or any attachments. Whilst Man Investments makes every effort to keep its network free from viruses, it does not accept responsibility for any computer virus which might be transferred by way of this email or any attachments. This email does not constitute a request, offer, recommendation or solicitation of any kind to buy, subscribe, sell or redeem any investment instruments or to perform other such transactions of any kind. Man Investments reserves the right to monitor, record and retain all electronic communications through its network to ensure the integrity of its systems, for record keeping and regulatory purposes. Visit us at: maninvestments.com
Sklyar, Oleg (MI London)
2008-Apr-10 15:32 UTC
[Rd] ISOdate/ISOdatetime performance suggestions, other date/time questions
small correction: # to ensure 0, although it will be overwritten when assigning hour origin = as.POSIXct("1970-01-01")-as.numeric(as.POSIXct("1970-01-01")) Dr Oleg Sklyar Technology Group Man Investments Ltd +44 (0)20 7144 3803 osklyar at maninvestments.com> -----Original Message----- > From: r-devel-bounces at r-project.org > [mailto:r-devel-bounces at r-project.org] On Behalf Of Sklyar, > Oleg (MI London) > Sent: 10 April 2008 14:52 > To: R-devel at r-project.org > Subject: [Rd] ISOdate/ISOdatetime performance suggestions, > other date/time questions > > Dear list: > > working with date/times I have come across a problem that > ISOdate and ISOdatetime are too slow on large vectors of > data. I was surprised just until I looked at the > implementation and the man page: "ISOdatetime and ISOdate are > convenience wrappers for strptime". In other terms, they > convert data to character representation first in order to > create a POSIXlt object that is then converted to POSIXct. > And POSIXct, i.e. the number of seconds since 1970, is really > what one wants most often. > > Obviously this is not a bug, but it is really a suboptimal > implementation of a pretty important function as the example > below shows. > > Now my questions are: > > - any chance that the implementation can be changed in R > (suggested, well tz needs to be added)? > - is there a better pure-R (no-C) way than that shown below > to convert to POSIXct? > - any idea why in the example below fooling R into thinking a > list is POSIXlt is faster than just creating a POSIXlt by rep > or seq? It's not a huge difference, but still. Unfortunately > seq on POSIXlt returns POSIXct anyway, so the class of > 'origin' is set correctly. > - any idea why seq is faster than rep when applied on > POSIXct? There is hardly anything simpler than on double values... > > Thanks in advance for your comments, > Oleg > > It's common in finance to work with time stamps stored in a > form like %Y%m%d.%H%M%OS, e.g. 20080410.140444 for now, this > is what 'ts' in the example below is: > > ts = 1e4*trunc(rnorm(50000,2008,2)) + 1e2*trunc(runif(50000,1,12)) + > trunc(runif(50000,1,28)) + 1e-2*trunc(runif(50000,1,24)) + > 1e-4*trunc(runif(50000,1,60)) + 1e-6*runif(50000,1,60) > > posix.viaISOdate = function(x) { > date = trunc(x at .Data) > time = round(1e6*x at .Data%%1,2) > rtime = round(time) > z = list(sec=rtime%%1e2 + time%%1, > min=(rtime%/%1e2)%%1e2, > hour=rtime%/%1e4, > mday=date%%100, > mon=(date%/%100)%%100, > year=date%/%10000) > ISOdate(z$year,z$mon,z$mday,z$hour,z$min,z$sec) # to POSIXct } > > ## This is just a test of how is it faster to create a long > POSIXlt object ## before another implementations are given > > origin = as.POSIXct("1970-01-01") > > mean(sapply(1:25,function(i) system.time( > as.POSIXlt(rep(origin,600000)) > ))[1,]) > # [1] 0.3972 > > mean(sapply(1:25,function(i) system.time( > as.POSIXlt(seq(origin, origin, length.out=600000)) > ))[1,]) > # [1] 0.30528 > > > posix.viaPOSIXlt1 = function(x) { > origin = as.POSIXct("1970-01-01") > z = as.POSIXlt(seq(origin, origin, length.out=length(x))) > date = trunc(x at .Data) > time = round(1e6*x at .Data%%1,2) > rtime = round(time) > z$sec=rtime%%1e2 + time%%1 > z$min=(rtime%/%1e2)%%1e2 > z$hour=rtime%/%1e4 > z$mday=date%%100 > z$mon=(date%/%100)%%100-1 > z$year=date%/%10000-1900 > as.double(z) # to POSIXct > } > > posix.vialist = function(x) { > date = trunc(x at .Data) > time = round(1e6*x at .Data%%1,2) > rtime = round(time) > na = rep(0.0,length(x)) > z = list(sec=rtime%%1e2 + time%%1, > min=(rtime%/%1e2)%%1e2, > hour=rtime%/%1e4, > mday=date%%100, > mon=(date%/%100)%%100-1, > year=date%/%10000-1900, > wday=na,yday=na,isdst=na) > class(z) = c("POSIXt","POSIXlt") > as.double(z) # to POSIXct > } > > v1 = posix.viaISOdate(ts) > v2 = posix.viaPOSIXlt1(ts) > v3 = posix.vialist(ts) > > all(v1==v2 & v2==v3) > # [1] TRUE > > mean(sapply(1:25,function(i) system.time( > system.time(posix.viaISOdate(ts)) > ))[1,]) > # [1] 1.54244 > > mean(sapply(1:25,function(i) system.time( > system.time(posix.viaPOSIXlt1(ts)) > ))[1,]) > # [1] 0.37624 > > mean(sapply(1:25,function(i) system.time( > system.time(posix.vialist(ts)) > ))[1,]) > # [1] 0.35488 > > > > > sessionInfo() > R version 2.6.2 (2008-02-08) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLA > TE=C;LC_MO > NETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF- > 8;LC_NAME> C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_ID > ENTIFICATI > ON=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] rcompgen_0.1-17 > > Dr Oleg Sklyar > Technology Group > Man Investments Ltd > +44 (0)20 7144 3803 > osklyar at maninvestments.com > > > ********************************************************************** > The contents of this email are for the named addressee(s) only. > It contains information which may be confidential and privileged. > If you are not the intended recipient, please notify the > sender immediately, destroy this email and any attachments > and do not otherwise disclose or use them. Email transmission > is not a secure method of communication and Man Investments > cannot accept responsibility for the completeness or accuracy > of this email or any attachments. Whilst Man Investments > makes every effort to keep its network free from viruses, it > does not accept responsibility for any computer virus which > might be transferred by way of this email or any attachments. > This email does not constitute a request, offer, > recommendation or solicitation of any kind to buy, subscribe, > sell or redeem any investment instruments or to perform other > such transactions of any kind. Man Investments reserves the > right to monitor, record and retain all electronic > communications through its network to ensure the integrity of > its systems, for record keeping and regulatory purposes. > > Visit us at: maninvestments.com > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >