thr3ads.net - R devel - [Rd] as.Date.character speed improvement suggestion [Aug 2013]

If this information is useful, please help other people find it:
Share via:

McGehee, Robert

2013-Aug-16 17:54 UTC

[Rd] as.Date.character speed improvement suggestion

R-Devel,
I store and retrieve a large amount of financial data (millions of rows) in a
PostgreSQL database keyed by date (and represented in R by class Date).
Unfortunately, I frequently find that a great deal of processing time is spent
converting dates from character representations to Date class representations in
R, presumably because strptime is not fast for large vectors (>10,000
elements). I'd like to suggest a patch that speeds up the date conversion
considerably for most every large date vectors (up to 400x in some real life
cases).

I suspect most everyone with large vectors of class Date will find that most of
their values are duplicated (repeatedly). (There are, after all, only 36,524
days in a century.) Given this, as.Date.character can be sped up substantially
for large vectors by only calling strptime on unique dates and then filling in
the calculated values for the entire vector. Since the time savings can be
several minutes in real-life cases, I think this enhancement should certainly be
considered. Also, in a worst case scenario of a long vector with only one
duplicated value, the suggested change does not slow down the calculation.

Here's a proof of concept:
as.Date.character2 <- function(x, ...) {
    if (anyDuplicated(x)) {
        ux <- unique(x)
        idx <- match(x, ux)
        y <- as.Date.character(ux, ...)
        return(y[idx])
    }
    as.Date.character(x, ...)
}

## Example1: Construct a 1-million length character vector of 1000 unique dates
## By considering only unique values, speed is >250x faster
> dtch <- format(sample(Sys.Date()-1:1000, 1e6, replace=TRUE))
> system.time(dt1 <- as.Date.character(dtch))   user  system elapsed 
 12.630  23.628  36.262> system.time(dt2 <- as.Date.character2(dtch))   user  system elapsed 
  0.117   0.019   0.136 > identical(dt1, dt2)[1] TRUE


## Example2: In a "worst case" scenario of a 1,000,002 length
character of 1,000,001 unique dates
## the new function is not any slower (within error).> dtch <- format(c(Sys.Date(), Sys.Date()+-5e5:5e5))
> system.time(dt1 <- as.Date.character(dtch))   user  system elapsed 
 20.264  25.584  45.855> system.time(dt2 <- as.Date.character2(dtch))   user  system elapsed 
 20.525  24.809  45.335 > identical(dt1, dt2)[1] TRUE

Alternatively, this logic should be built in to strptime itself.

Robert

Simon Urbanek

2013-Aug-16 20:02 UTC

head link

[Rd] as.Date.character speed improvement suggestion

On Aug 16, 2013, at 1:54 PM, McGehee, Robert wrote:
> R-Devel,
> I store and retrieve a large amount of financial data (millions of rows) in
a PostgreSQL database keyed by date (and represented in R by class Date).
Unfortunately, I frequently find that a great deal of processing time is spent
converting dates from character representations to Date class representations in
R, presumably because strptime is not fast for large vectors (>10,000
elements). I'd like to suggest a patch that speeds up the date conversion
considerably for most every large date vectors (up to 400x in some real life
cases).
> 
This is more of a comment: if you want speed and have a standard date format,
you can use fastPOSIXct from fasttime. The real bottleneck are system calls that
do the conversion and fasttime is avoiding them by doing fast string parsing
instead:
> system.time(dt1 <- as.Date.character(dtch))   user  system elapsed 
 31.513   0.046  31.559 > system.time(dt1 <- as.Date(fasttime::fastPOSIXct(dtch)))   user  system elapsed 
  0.055   0.018   0.074 

Cutting back to unique dates may works for some applications (not for any of
ours because we are always dealing with timestamps - but that's why we use
POSIXct and not Date), but I'd argue that you may as well do it right away
in your specialized application instead.

Cheers,
Simon


> I suspect most everyone with large vectors of class Date will find that
most of their values are duplicated (repeatedly). (There are, after all, only
36,524 days in a century.) Given this, as.Date.character can be sped up
substantially for large vectors by only calling strptime on unique dates and
then filling in the calculated values for the entire vector. Since the time
savings can be several minutes in real-life cases, I think this enhancement
should certainly be considered. Also, in a worst case scenario of a long vector
with only one duplicated value, the suggested change does not slow down the
calculation.
> 
> Here's a proof of concept:
> as.Date.character2 <- function(x, ...) {
>    if (anyDuplicated(x)) {
>        ux <- unique(x)
>        idx <- match(x, ux)
>        y <- as.Date.character(ux, ...)
>        return(y[idx])
>    }
>    as.Date.character(x, ...)
> }
> 
> ## Example1: Construct a 1-million length character vector of 1000 unique
dates
> ## By considering only unique values, speed is >250x faster
> 
>> dtch <- format(sample(Sys.Date()-1:1000, 1e6, replace=TRUE))
>> system.time(dt1 <- as.Date.character(dtch))
>   user  system elapsed 
> 12.630  23.628  36.262
>> system.time(dt2 <- as.Date.character2(dtch))
>   user  system elapsed 
>  0.117   0.019   0.136 
>> identical(dt1, dt2)
> [1] TRUE
> 
> 
> ## Example2: In a "worst case" scenario of a 1,000,002 length
character of 1,000,001 unique dates
> ## the new function is not any slower (within error).
>> dtch <- format(c(Sys.Date(), Sys.Date()+-5e5:5e5))
>> system.time(dt1 <- as.Date.character(dtch))
>   user  system elapsed 
> 20.264  25.584  45.855
>> system.time(dt2 <- as.Date.character2(dtch))
>   user  system elapsed 
> 20.525  24.809  45.335 
>> identical(dt1, dt2)
> [1] TRUE
> 
> Alternatively, this logic should be built in to strptime itself.
> 
> Robert
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>

Maybe Matching Threads

Search for more reasonably related threads

R devel - Aug 2013 - as.Date.character speed improvement suggestion

[Rd] as.Date.character speed improvement suggestion

[Rd] as.Date.character speed improvement suggestion

Maybe Matching Threads