I retrieve for a few hundred times a group of time series (10-15 ts with 10000 values each), on every group I do some calculation, graphs etc. I wonder if there is a faster method than what presented below to get an appropriate timeseries object. Making a query with RODBC for every group I get a data frame like this:> XID DATE VALUE 14 3 2000-01-01 00:00:03 0.5726334 4 1 2000-01-01 00:00:03 0.8830174 1 1 2000-01-01 00:00:00 0.2875775 15 3 2000-01-01 00:00:04 0.1029247 11 3 2000-01-01 00:00:00 0.9568333 9 2 2000-01-01 00:00:03 0.5514350 7 2 2000-01-01 00:00:01 0.5281055 6 2 2000-01-01 00:00:00 0.0455565 12 3 2000-01-01 00:00:01 0.4533342 8 2 2000-01-01 00:00:02 0.8924190 3 1 2000-01-01 00:00:02 0.4089769 13 3 2000-01-01 00:00:02 0.6775706 And I want to get a timeSeries object or xts object like this: 1 2 3 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 2000-01-01 00:00:01 NA 0.5281055 0.4533342 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 2000-01-01 00:00:04 NA NA 0.1029247 Input data can be sorted or unsorted (the most complicated case is in the example, unsorted and missing data) in the sense that I can sort in query if I can take an advantage from this. Some considerations: - Xts is generally faster than timeSeries - both accept a matrix so if I can create a matrix like the one represented above and an array of characters representing dates faster than what possible with xts:::cbind, for examole,I will have a faster implementation (package data.table ?). - create timeseries objects in multithread and then merge (package plyr ?) - faster merge algorithms? Below some code to generate the test case above: set.seed(123) N <- 5 # number of observations K <- 3 # number of timeseries ID X <- data.frame( ID = rep(1:K, each = N), DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)), VALUE = runif(N*K), stringsAsFactors = FALSE) X <- X[sample(1:(N*K), N*K),] # sample observations to get random order (optional) X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing head(X, 15) # use explicitly environments to avoid '<<-' buildTimeSeriesFromDataFrame <- function(x, env) { { if(exists("xx", envir = env)) # if exist variable xx in env cbind assign("xx", cbind(get("xx", env), timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S', zone = 'GMT', units = as.character(x$ID[1]))), envir = env) else # create xx in env assign("xx", timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S', zone = 'GMT', units = as.character(x$ID[1])), envir = env) return(TRUE) } } # use package plyr, faster than 'by' function tsDaply <- function(...) { library(plyr) e1 <- new.env(parent = baseenv()) #create a new env res <- daply(X, "ID", buildTimeSeriesFromDataFrame, env = e1) return(get("xx", e1)) # return xx from env } ##replicate 100 times #Time03 <- replicate(100, # system.time(tsDaply(X, X$ID))[[1]]) #median(Time03) # result tsDaply(X, X$ID) Thanks in advance for any input, best regards, Den
Hi Dan, On Mon, Apr 4, 2011 at 7:49 AM, Den Alpin <den.alpin at gmail.com> wrote:> I retrieve for a few hundred times a group of time series (10-15 ts > with 10000 values each), on every group I do some calculation, graphs > etc. I wonder if there is a faster method than what presented below to > get an appropriate timeseries object. > > Making a query with RODBC for every group I get a data frame like this: > >> X > ? ID ? ? ? ? ? ? ? ?DATE ? ? VALUE > 14 ?3 2000-01-01 00:00:03 0.5726334 > 4 ? 1 2000-01-01 00:00:03 0.8830174 > 1 ? 1 2000-01-01 00:00:00 0.2875775 > 15 ?3 2000-01-01 00:00:04 0.1029247 > 11 ?3 2000-01-01 00:00:00 0.9568333 > 9 ? 2 2000-01-01 00:00:03 0.5514350 > 7 ? 2 2000-01-01 00:00:01 0.5281055 > 6 ? 2 2000-01-01 00:00:00 0.0455565 > 12 ?3 2000-01-01 00:00:01 0.4533342 > 8 ? 2 2000-01-01 00:00:02 0.8924190 > 3 ? 1 2000-01-01 00:00:02 0.4089769 > 13 ?3 2000-01-01 00:00:02 0.6775706 > > And I want to get a timeSeries object or xts object like this: > > ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 ? ? ? ? 2 ? ? ? ? 3 > 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 > 2000-01-01 00:00:01 ? ? ? ?NA 0.5281055 0.4533342 > 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 > 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 > 2000-01-01 00:00:04 ? ? ? ?NA ? ? ? ?NA 0.1029247 > > > Input data can be sorted or unsorted (the most complicated case is in > the example, unsorted and missing data) in the sense that I can sort > in query if I can take an advantage from this. > > Some considerations: > - Xts is generally faster than timeSeries > - both accept a matrix so if I can create a matrix like the one > represented above and an array of characters representing dates faster > than what possible with xts:::cbind, for examole,I will have a faster > implementation (package data.table ?). > - create timeseries objects in multithread and then merge (package plyr ?) > - faster merge algorithms? > > Below some code to generate the test case above: > > > set.seed(123) > N <- 5 # number of observations > K <- 3 # number of timeseries ID > > X <- data.frame( > ?ID = rep(1:K, each = N), > ?DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)), > ?VALUE = runif(N*K), stringsAsFactors = FALSE) > > X <- X[sample(1:(N*K), N*K),] # sample observations to get random > order (optional) > X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing > > head(X, 15) > > # use explicitly environments to avoid '<<-' > buildTimeSeriesFromDataFrame <- function(x, env) > { > ?{ > ? ?if(exists("xx", envir = env)) # if exist variable xx in env cbind > ? ? ?assign("xx", > ? ? ? ?cbind(get("xx", env), timeSeries(x$VALUE, x$DATE, > ? ? ? ? ?format = '%Y-%m-%d %H:%M:%S', > ? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1]))), > ? ? ? ?envir = env) > ? ?else ?# create xx in env > ? ? ?assign("xx", > ? ? ? ?timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S', > ? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1])), > ? ? ? ?envir = env) > > ? ?return(TRUE) > ?} > } > > # use package plyr, faster than 'by' function > tsDaply <- function(...) > { > ?library(plyr) > ?e1 <- new.env(parent = baseenv()) #create a new env > ?res <- daply(X, "ID", buildTimeSeriesFromDataFrame, > ? ? ?env = e1) > ?return(get("xx", e1)) # return xx from env > } > > ##replicate 100 times > #Time03 <- replicate(100, > # ?system.time(tsDaply(X, X$ID))[[1]]) > #median(Time03) > > # result > tsDaply(X, X$ID) > > > Thanks in advance for any input, best regards, > Den > >Here's how I would do it with xts: x <- xts(X[,c("ID","VALUE")], as.POSIXct(X[,"DATE"])) do.call(merge, split(x$VALUE,x$ID)) My xts solution compares favorably to your solution:> Time03 <- replicate(100,+ system.time(tsDaply(X, X$ID))[[1]])> median(Time03)[1] 0.02> xtsTime <- replicate(100,+ system.time(do.call(merge, split(x$VALUE,x$ID)))[[1]])> median(xtsTime)[1] 0 Best, -- Joshua Ulrich | FOSS Trading: www.fosstrading.com
Gabor Grothendieck
2011-Apr-04 14:08 UTC
[R] How to speed up grouping time series, help please
On Mon, Apr 4, 2011 at 8:49 AM, Den Alpin <den.alpin at gmail.com> wrote:> I retrieve for a few hundred times a group of time series (10-15 ts > with 10000 values each), on every group I do some calculation, graphs > etc. I wonder if there is a faster method than what presented below to > get an appropriate timeseries object. > > Making a query with RODBC for every group I get a data frame like this: > >> X > ? ID ? ? ? ? ? ? ? ?DATE ? ? VALUE > 14 ?3 2000-01-01 00:00:03 0.5726334 > 4 ? 1 2000-01-01 00:00:03 0.8830174 > 1 ? 1 2000-01-01 00:00:00 0.2875775 > 15 ?3 2000-01-01 00:00:04 0.1029247 > 11 ?3 2000-01-01 00:00:00 0.9568333 > 9 ? 2 2000-01-01 00:00:03 0.5514350 > 7 ? 2 2000-01-01 00:00:01 0.5281055 > 6 ? 2 2000-01-01 00:00:00 0.0455565 > 12 ?3 2000-01-01 00:00:01 0.4533342 > 8 ? 2 2000-01-01 00:00:02 0.8924190 > 3 ? 1 2000-01-01 00:00:02 0.4089769 > 13 ?3 2000-01-01 00:00:02 0.6775706 > > And I want to get a timeSeries object or xts object like this: > > ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 ? ? ? ? 2 ? ? ? ? 3 > 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 > 2000-01-01 00:00:01 ? ? ? ?NA 0.5281055 0.4533342 > 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 > 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 > 2000-01-01 00:00:04 ? ? ? ?NA ? ? ? ?NA 0.1029247 > > > Input data can be sorted or unsorted (the most complicated case is in > the example, unsorted and missing data) in the sense that I can sort > in query if I can take an advantage from this. > > Some considerations: > - Xts is generally faster than timeSeries > - both accept a matrix so if I can create a matrix like the one > represented above and an array of characters representing dates faster > than what possible with xts:::cbind, for examole,I will have a faster > implementation (package data.table ?). > - create timeseries objects in multithread and then merge (package plyr ?) > - faster merge algorithms? > > Below some code to generate the test case above: > > > set.seed(123) > N <- 5 # number of observations > K <- 3 # number of timeseries ID > > X <- data.frame( > ?ID = rep(1:K, each = N), > ?DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)), > ?VALUE = runif(N*K), stringsAsFactors = FALSE) > > X <- X[sample(1:(N*K), N*K),] # sample observations to get random > order (optional) > X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing > > head(X, 15) > > # use explicitly environments to avoid '<<-' > buildTimeSeriesFromDataFrame <- function(x, env) > { > ?{ > ? ?if(exists("xx", envir = env)) # if exist variable xx in env cbind > ? ? ?assign("xx", > ? ? ? ?cbind(get("xx", env), timeSeries(x$VALUE, x$DATE, > ? ? ? ? ?format = '%Y-%m-%d %H:%M:%S', > ? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1]))), > ? ? ? ?envir = env) > ? ?else ?# create xx in env > ? ? ?assign("xx", > ? ? ? ?timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S', > ? ? ? ? ?zone = 'GMT', units = as.character(x$ID[1])), > ? ? ? ?envir = env) > > ? ?return(TRUE) > ?} > } > > # use package plyr, faster than 'by' function > tsDaply <- function(...) > { > ?library(plyr) > ?e1 <- new.env(parent = baseenv()) #create a new env > ?res <- daply(X, "ID", buildTimeSeriesFromDataFrame, > ? ? ?env = e1) > ?return(get("xx", e1)) # return xx from env > } > > ##replicate 100 times > #Time03 <- replicate(100, > # ?system.time(tsDaply(X, X$ID))[[1]]) > #median(Time03) > > # result > tsDaply(X, X$ID) >Haven't checked how fast it is but using read.zoo its just one line of code to produce the required matrix: # set up input data frame, DF Lines <- "ID,DATE,VALUE 3,2000-01-01 00:00:03,0.5726334 1,2000-01-01 00:00:03,0.8830174 1,2000-01-01 00:00:00,0.2875775 3,2000-01-01 00:00:04,0.1029247 3,2000-01-01 00:00:00,0.9568333 2,2000-01-01 00:00:03,0.5514350 2,2000-01-01 00:00:01,0.5281055 2,2000-01-01 00:00:00,0.0455565 3,2000-01-01 00:00:01,0.4533342 2,2000-01-01 00:00:02,0.8924190 1,2000-01-01 00:00:02,0.4089769 3,2000-01-01 00:00:02,0.6775706" DF <- read.table(textConnection(Lines), header = TRUE, sep = ",") # create zoo matrix library(zoo) z <- read.zoo(DF, split = 1, index = 2, tz = "") The last line gives:> z1 2 3 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 2000-01-01 00:00:01 NA 0.5281055 0.4533342 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 2000-01-01 00:00:04 NA NA 0.1029247 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
I found a faster implementation (by an order of magnitude from my tests) than the one using xts, split, merge (from Joshua). I report the two fastest solution below with code to generate a test case; some work still to be done for columns order and naming, Test case has grown from my previous post to get a more realistic timing. Any comment or idea to further speed up multivariate time series creation with classes xts or timeSeries starting from a data.frame like the one reported here is welcome. Best regards, Den a data.frame example (code below to generate it) ?ID ? ? ? ? ? ? ? ?DATE ? ? VALUE 14 ?3 2000-01-01 00:00:03 0.5726334 4 ? 1 2000-01-01 00:00:03 0.8830174 1 ? 1 2000-01-01 00:00:00 0.2875775 15 ?3 2000-01-01 00:00:04 0.1029247 11 ?3 2000-01-01 00:00:00 0.9568333 9 ? 2 2000-01-01 00:00:03 0.5514350 7 ? 2 2000-01-01 00:00:01 0.5281055 6 ? 2 2000-01-01 00:00:00 0.0455565 12 ?3 2000-01-01 00:00:01 0.4533342 8 ? 2 2000-01-01 00:00:02 0.8924190 3 ? 1 2000-01-01 00:00:02 0.4089769 13 ?3 2000-01-01 00:00:02 0.6775706 And I want to get a timeSeries object or xts object like this: ? ? ? ? ? ? ? ? ? ? ? ? ? 1 ? ? ? ? 2 ? ? ? ? 3 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 2000-01-01 00:00:01 ? ? ? ?NA 0.5281055 0.4533342 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 2000-01-01 00:00:04 ? ? ? ?NA ? ? ? ?NA 0.1029247 # CODE: set.seed(123) # set N to 5 to reproduce above data.frame N <- 1000 # set K to 3 to reproduce above data.frame K <- 10 X <- data.frame( ?ID = rep(1:K, each = N), ?DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)), ?VALUE = runif(N*K), stringsAsFactors = FALSE) X <- X[sample(1:(N*K), N*K),] X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] str(X) xtsSplit <- function(x) { ?library(xts) ?x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"])) ?return(do.call(merge, split(x$VALUE,x$ID))) } xtsSplitTime <- replicate(50, ?system.time(xtsSplit(X))[[1]]) median(xtsSplitTime) xtsReshape <- function(x) { ?library(xts) ?x <- reshape(x, idvar = "DATE", timevar = "ID", direction = "wide") ?x <- xts(x[,-1], as.POSIXct(x[,1])) ?return(x) } xtsReshapeTime <- replicate(50, ?system.time(xtsReshape(X))[[1]]) median(xtsReshapeTime)