Sklyar, Oleg (MI London)
2008-Apr-10 13:51 UTC
[Rd] ISOdate/ISOdatetime performance suggestions, other date/time questions
Dear list:
working with date/times I have come across a problem that ISOdate and
ISOdatetime are too slow on large vectors of data. I was surprised just
until I looked at the implementation and the man page: "ISOdatetime and
ISOdate are convenience wrappers for strptime". In other terms, they
convert data to character representation first in order to create a
POSIXlt object that is then converted to POSIXct. And POSIXct, i.e. the
number of seconds since 1970, is really what one wants most often.
Obviously this is not a bug, but it is really a suboptimal
implementation of a pretty important function as the example below
shows.
Now my questions are:
- any chance that the implementation can be changed in R (suggested,
well tz needs to be added)?
- is there a better pure-R (no-C) way than that shown below to convert
to POSIXct?
- any idea why in the example below fooling R into thinking a list is
POSIXlt is faster than just creating a POSIXlt by rep or seq? It's not a
huge difference, but still. Unfortunately seq on POSIXlt returns POSIXct
anyway, so the class of 'origin' is set correctly.
- any idea why seq is faster than rep when applied on POSIXct? There is
hardly anything simpler than on double values...
Thanks in advance for your comments,
Oleg
It's common in finance to work with time stamps stored in a form like
%Y%m%d.%H%M%OS, e.g. 20080410.140444 for now, this is what 'ts' in the
example below is:
ts = 1e4*trunc(rnorm(50000,2008,2)) + 1e2*trunc(runif(50000,1,12)) +
trunc(runif(50000,1,28)) + 1e-2*trunc(runif(50000,1,24)) +
1e-4*trunc(runif(50000,1,60)) + 1e-6*runif(50000,1,60)
posix.viaISOdate = function(x) {
date = trunc(x at .Data)
time = round(1e6*x at .Data%%1,2)
rtime = round(time)
z = list(sec=rtime%%1e2 + time%%1,
min=(rtime%/%1e2)%%1e2,
hour=rtime%/%1e4,
mday=date%%100,
mon=(date%/%100)%%100,
year=date%/%10000)
ISOdate(z$year,z$mon,z$mday,z$hour,z$min,z$sec) # to POSIXct
}
## This is just a test of how is it faster to create a long POSIXlt
object
## before another implementations are given
origin = as.POSIXct("1970-01-01")
mean(sapply(1:25,function(i) system.time(
as.POSIXlt(rep(origin,600000))
))[1,])
# [1] 0.3972
mean(sapply(1:25,function(i) system.time(
as.POSIXlt(seq(origin, origin, length.out=600000))
))[1,])
# [1] 0.30528
posix.viaPOSIXlt1 = function(x) {
origin = as.POSIXct("1970-01-01")
z = as.POSIXlt(seq(origin, origin, length.out=length(x)))
date = trunc(x at .Data)
time = round(1e6*x at .Data%%1,2)
rtime = round(time)
z$sec=rtime%%1e2 + time%%1
z$min=(rtime%/%1e2)%%1e2
z$hour=rtime%/%1e4
z$mday=date%%100
z$mon=(date%/%100)%%100-1
z$year=date%/%10000-1900
as.double(z) # to POSIXct
}
posix.vialist = function(x) {
date = trunc(x at .Data)
time = round(1e6*x at .Data%%1,2)
rtime = round(time)
na = rep(0.0,length(x))
z = list(sec=rtime%%1e2 + time%%1,
min=(rtime%/%1e2)%%1e2,
hour=rtime%/%1e4,
mday=date%%100,
mon=(date%/%100)%%100-1,
year=date%/%10000-1900,
wday=na,yday=na,isdst=na)
class(z) = c("POSIXt","POSIXlt")
as.double(z) # to POSIXct
}
v1 = posix.viaISOdate(ts)
v2 = posix.viaPOSIXlt1(ts)
v3 = posix.vialist(ts)
all(v1==v2 & v2==v3)
# [1] TRUE
mean(sapply(1:25,function(i) system.time(
system.time(posix.viaISOdate(ts))
))[1,])
# [1] 1.54244
mean(sapply(1:25,function(i) system.time(
system.time(posix.viaPOSIXlt1(ts))
))[1,])
# [1] 0.37624
mean(sapply(1:25,function(i) system.time(
system.time(posix.vialist(ts))
))[1,])
# [1] 0.35488
sessionInfo()
R version 2.6.2 (2008-02-08)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=C;LC_MO
NETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAMEC;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATI
ON=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] rcompgen_0.1-17
Dr Oleg Sklyar
Technology Group
Man Investments Ltd
+44 (0)20 7144 3803
osklyar at maninvestments.com
**********************************************************************
The contents of this email are for the named addressee(s) only.
It contains information which may be confidential and privileged.
If you are not the intended recipient, please notify the sender
immediately, destroy this email and any attachments and do not
otherwise disclose or use them. Email transmission is not a
secure method of communication and Man Investments cannot accept
responsibility for the completeness or accuracy of this email or
any attachments. Whilst Man Investments makes every effort to keep
its network free from viruses, it does not accept responsibility
for any computer virus which might be transferred by way of this
email or any attachments. This email does not constitute a request,
offer, recommendation or solicitation of any kind to buy, subscribe,
sell or redeem any investment instruments or to perform other such
transactions of any kind. Man Investments reserves the right to
monitor, record and retain all electronic communications through
its network to ensure the integrity of its systems, for record
keeping and regulatory purposes.
Visit us at: www.maninvestments.com
Sklyar, Oleg (MI London)
2008-Apr-10 15:32 UTC
[Rd] ISOdate/ISOdatetime performance suggestions, other date/time questions
small correction:
# to ensure 0, although it will be overwritten when assigning hour
origin =
as.POSIXct("1970-01-01")-as.numeric(as.POSIXct("1970-01-01"))
Dr Oleg Sklyar
Technology Group
Man Investments Ltd
+44 (0)20 7144 3803
osklyar at maninvestments.com
> -----Original Message-----
> From: r-devel-bounces at r-project.org
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Sklyar,
> Oleg (MI London)
> Sent: 10 April 2008 14:52
> To: R-devel at r-project.org
> Subject: [Rd] ISOdate/ISOdatetime performance suggestions,
> other date/time questions
>
> Dear list:
>
> working with date/times I have come across a problem that
> ISOdate and ISOdatetime are too slow on large vectors of
> data. I was surprised just until I looked at the
> implementation and the man page: "ISOdatetime and ISOdate are
> convenience wrappers for strptime". In other terms, they
> convert data to character representation first in order to
> create a POSIXlt object that is then converted to POSIXct.
> And POSIXct, i.e. the number of seconds since 1970, is really
> what one wants most often.
>
> Obviously this is not a bug, but it is really a suboptimal
> implementation of a pretty important function as the example
> below shows.
>
> Now my questions are:
>
> - any chance that the implementation can be changed in R
> (suggested, well tz needs to be added)?
> - is there a better pure-R (no-C) way than that shown below
> to convert to POSIXct?
> - any idea why in the example below fooling R into thinking a
> list is POSIXlt is faster than just creating a POSIXlt by rep
> or seq? It's not a huge difference, but still. Unfortunately
> seq on POSIXlt returns POSIXct anyway, so the class of
> 'origin' is set correctly.
> - any idea why seq is faster than rep when applied on
> POSIXct? There is hardly anything simpler than on double values...
>
> Thanks in advance for your comments,
> Oleg
>
> It's common in finance to work with time stamps stored in a
> form like %Y%m%d.%H%M%OS, e.g. 20080410.140444 for now, this
> is what 'ts' in the example below is:
>
> ts = 1e4*trunc(rnorm(50000,2008,2)) + 1e2*trunc(runif(50000,1,12)) +
> trunc(runif(50000,1,28)) + 1e-2*trunc(runif(50000,1,24)) +
> 1e-4*trunc(runif(50000,1,60)) + 1e-6*runif(50000,1,60)
>
> posix.viaISOdate = function(x) {
> date = trunc(x at .Data)
> time = round(1e6*x at .Data%%1,2)
> rtime = round(time)
> z = list(sec=rtime%%1e2 + time%%1,
> min=(rtime%/%1e2)%%1e2,
> hour=rtime%/%1e4,
> mday=date%%100,
> mon=(date%/%100)%%100,
> year=date%/%10000)
> ISOdate(z$year,z$mon,z$mday,z$hour,z$min,z$sec) # to POSIXct }
>
> ## This is just a test of how is it faster to create a long
> POSIXlt object ## before another implementations are given
>
> origin = as.POSIXct("1970-01-01")
>
> mean(sapply(1:25,function(i) system.time(
> as.POSIXlt(rep(origin,600000))
> ))[1,])
> # [1] 0.3972
>
> mean(sapply(1:25,function(i) system.time(
> as.POSIXlt(seq(origin, origin, length.out=600000))
> ))[1,])
> # [1] 0.30528
>
>
> posix.viaPOSIXlt1 = function(x) {
> origin = as.POSIXct("1970-01-01")
> z = as.POSIXlt(seq(origin, origin, length.out=length(x)))
> date = trunc(x at .Data)
> time = round(1e6*x at .Data%%1,2)
> rtime = round(time)
> z$sec=rtime%%1e2 + time%%1
> z$min=(rtime%/%1e2)%%1e2
> z$hour=rtime%/%1e4
> z$mday=date%%100
> z$mon=(date%/%100)%%100-1
> z$year=date%/%10000-1900
> as.double(z) # to POSIXct
> }
>
> posix.vialist = function(x) {
> date = trunc(x at .Data)
> time = round(1e6*x at .Data%%1,2)
> rtime = round(time)
> na = rep(0.0,length(x))
> z = list(sec=rtime%%1e2 + time%%1,
> min=(rtime%/%1e2)%%1e2,
> hour=rtime%/%1e4,
> mday=date%%100,
> mon=(date%/%100)%%100-1,
> year=date%/%10000-1900,
> wday=na,yday=na,isdst=na)
> class(z) = c("POSIXt","POSIXlt")
> as.double(z) # to POSIXct
> }
>
> v1 = posix.viaISOdate(ts)
> v2 = posix.viaPOSIXlt1(ts)
> v3 = posix.vialist(ts)
>
> all(v1==v2 & v2==v3)
> # [1] TRUE
>
> mean(sapply(1:25,function(i) system.time(
> system.time(posix.viaISOdate(ts))
> ))[1,])
> # [1] 1.54244
>
> mean(sapply(1:25,function(i) system.time(
> system.time(posix.viaPOSIXlt1(ts))
> ))[1,])
> # [1] 0.37624
>
> mean(sapply(1:25,function(i) system.time(
> system.time(posix.vialist(ts))
> ))[1,])
> # [1] 0.35488
>
>
>
>
> sessionInfo()
> R version 2.6.2 (2008-02-08)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLA
> TE=C;LC_MO
> NETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-
> 8;LC_NAME>
C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_ID
> ENTIFICATI
> ON=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] rcompgen_0.1-17
>
> Dr Oleg Sklyar
> Technology Group
> Man Investments Ltd
> +44 (0)20 7144 3803
> osklyar at maninvestments.com
>
>
> **********************************************************************
> The contents of this email are for the named addressee(s) only.
> It contains information which may be confidential and privileged.
> If you are not the intended recipient, please notify the
> sender immediately, destroy this email and any attachments
> and do not otherwise disclose or use them. Email transmission
> is not a secure method of communication and Man Investments
> cannot accept responsibility for the completeness or accuracy
> of this email or any attachments. Whilst Man Investments
> makes every effort to keep its network free from viruses, it
> does not accept responsibility for any computer virus which
> might be transferred by way of this email or any attachments.
> This email does not constitute a request, offer,
> recommendation or solicitation of any kind to buy, subscribe,
> sell or redeem any investment instruments or to perform other
> such transactions of any kind. Man Investments reserves the
> right to monitor, record and retain all electronic
> communications through its network to ensure the integrity of
> its systems, for record keeping and regulatory purposes.
>
> Visit us at: www.maninvestments.com
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>