Thanks all. roll::roll_lm was essentially what I wanted. I think maybe I would prefer it to have options to return a few more things, but it is the coefficients, and the remaining statistics you might want can be calculated fast enough from there. On Thu, Jul 21, 2016 at 12:36 PM, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:> Jeremiah, > > for this purpose there are the "roll" and "RcppRoll" packages. Both use > Rcpp and the former also provides rolling lm models. The latter has a > generic interface that let's you define your own function. > > One thing to pay attention to, though, is the numerical reliability. > Especially on large time series with relatively short windows there is a > good chance of encountering numerically challenging situations. The QR > decomposition used by lm is fairly robust while other more straightforward > matrix multiplications may not be. This should be kept in mind when writing > your own Rcpp code for plugging it into RcppRoll. > > But I haven't check what the roll package does and how reliable that is... > > hth, > Z > > > On Thu, 21 Jul 2016, jeremiah rounds wrote: > > Hi, >> >> A not unusual task is performing a multiple regression in a rolling window >> on a time-series. A standard piece of advice for doing in R is >> something >> like the code that follows at the end of the email. I am currently using >> an "embed" variant of that code and that piece of advice is out there too. >> >> But, it occurs to me that for such an easily specified matrix operation >> standard R code is really slow. rollapply constantly returns to R >> interpreter at each window step for a new lm. All lm is at its heart is >> (X^t X)^(-1) * Xy, and if you think about doing that with Rcpp in rolling >> window you are just incrementing a counter and peeling off rows (or >> columns >> of X and y) of a particular window size, and following that up with some >> matrix multiplication in a loop. The psuedo-code for that Rcpp >> practically writes itself and you might want a wrapper of something like: >> rolling_lm (y=y, x=x, width=4). >> >> My question is this: has any of the thousands of R packages out there >> published anything like that. Rolling window multiple regressions that >> stay in C/C++ until the rolling window completes? No sense and writing it >> if it exist. >> >> >> Thanks, >> Jeremiah >> >> Standard (slow) advice for "rolling window regression" follows: >> >> >> set.seed(1) >> z <- zoo(matrix(rnorm(10), ncol = 2)) >> colnames(z) <- c("y", "x") >> >> ## rolling regression of width 4 >> rollapply(z, width = 4, >> function(x) coef(lm(y ~ x, data = as.data.frame(x))), >> by.column = FALSE, align = "right") >> >> ## result is identical to >> coef(lm(y ~ x, data = z[1:4,])) >> coef(lm(y ~ x, data = z[2:5,])) >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >>[[alternative HTML version deleted]]
I would be careful about making assumptions regarding what is faster. Performance tends to be nonintuitive. When I ran rollapply/lm, rollapply/fastLm and roll_lm on the example you provided rollapply/fastLm was three times faster than roll_lm. Of course this could change with data of different dimensions but it would be worthwhile to do actual benchmarks before making assumptions. I also noticed that roll_lm did not give the same coefficients as the other two. set.seed(1) library(zoo) library(RcppArmadillo) library(roll) z <- zoo(matrix(rnorm(10), ncol = 2)) colnames(z) <- c("y", "x") ## rolling regression of width 4 library(rbenchmark) benchmark(fastLm = rollapplyr(z, width = 4, function(x) coef(fastLm(cbind(1, x[, 2]), x[, 1])), by.column = FALSE), lm = rollapplyr(z, width = 4, function(x) coef(lm(y ~ x, data = as.data.frame(x))), by.column = FALSE), roll_lm = roll_lm(coredata(z[, 1, drop = F]), coredata(z[, 2, drop = F]), 4, center = FALSE))[1:4] test replications elapsed relative 1 fastLm 100 0.22 1.000 2 lm 100 0.72 3.273 3 roll_lm 100 0.64 2.909 On Thu, Jul 21, 2016 at 3:45 PM, jeremiah rounds <roundsjeremiah at gmail.com> wrote:> Thanks all. roll::roll_lm was essentially what I wanted. I think maybe > I would prefer it to have options to return a few more things, but it is > the coefficients, and the remaining statistics you might want can be > calculated fast enough from there. > > > On Thu, Jul 21, 2016 at 12:36 PM, Achim Zeileis <Achim.Zeileis at uibk.ac.at> > wrote: > >> Jeremiah, >> >> for this purpose there are the "roll" and "RcppRoll" packages. Both use >> Rcpp and the former also provides rolling lm models. The latter has a >> generic interface that let's you define your own function. >> >> One thing to pay attention to, though, is the numerical reliability. >> Especially on large time series with relatively short windows there is a >> good chance of encountering numerically challenging situations. The QR >> decomposition used by lm is fairly robust while other more straightforward >> matrix multiplications may not be. This should be kept in mind when writing >> your own Rcpp code for plugging it into RcppRoll. >> >> But I haven't check what the roll package does and how reliable that is... >> >> hth, >> Z >> >> >> On Thu, 21 Jul 2016, jeremiah rounds wrote: >> >> Hi, >>> >>> A not unusual task is performing a multiple regression in a rolling window >>> on a time-series. A standard piece of advice for doing in R is >>> something >>> like the code that follows at the end of the email. I am currently using >>> an "embed" variant of that code and that piece of advice is out there too. >>> >>> But, it occurs to me that for such an easily specified matrix operation >>> standard R code is really slow. rollapply constantly returns to R >>> interpreter at each window step for a new lm. All lm is at its heart is >>> (X^t X)^(-1) * Xy, and if you think about doing that with Rcpp in rolling >>> window you are just incrementing a counter and peeling off rows (or >>> columns >>> of X and y) of a particular window size, and following that up with some >>> matrix multiplication in a loop. The psuedo-code for that Rcpp >>> practically writes itself and you might want a wrapper of something like: >>> rolling_lm (y=y, x=x, width=4). >>> >>> My question is this: has any of the thousands of R packages out there >>> published anything like that. Rolling window multiple regressions that >>> stay in C/C++ until the rolling window completes? No sense and writing it >>> if it exist. >>> >>> >>> Thanks, >>> Jeremiah >>> >>> Standard (slow) advice for "rolling window regression" follows: >>> >>> >>> set.seed(1) >>> z <- zoo(matrix(rnorm(10), ncol = 2)) >>> colnames(z) <- c("y", "x") >>> >>> ## rolling regression of width 4 >>> rollapply(z, width = 4, >>> function(x) coef(lm(y ~ x, data = as.data.frame(x))), >>> by.column = FALSE, align = "right") >>> >>> ## result is identical to >>> coef(lm(y ~ x, data = z[1:4,])) >>> coef(lm(y ~ x, data = z[2:5,])) >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
I appreciate the timing, so much so I changed the code to show the issue. It is a problem of scale. roll_lm probably has a heavy start-up cost but otherwise completely out-performs those other versions at scale. I suspect you are timing the nearly constant time start-up cost in small data. I did give code to paint a picture, but it was just cartoon code lifted from stackexchange. If you want to characterize the real problem it is closer to: 30 day rolling windows on 24 daily (by hour) measurements for 5 years with 24+7 -1 dummy predictor variables and finally you need to do this for 300 sets of data. Pseudo-code is closer to what follows and roll_lm can handle that input in a timely manner. You can do it with lm.fit, but you need to spend a lot of time waiting. The issue of accuracy needs a follow-up check. Not sure why it would be different. Worth a check on that. Thanks, Jeremiah library(rbenchmark) N = 30*24*12*5 window = 30*24 npred = 15 #15 chosen arbitrarily... set.seed(1) library(zoo) library(RcppArmadillo) library(roll) x = matrix(rnorm(N*(npred+1)), ncol = npred+1) colnames(x) <- c("y", paste0("x", 1:npred)) z <- zoo(x) benchmark( roll_lm = roll_lm(coredata(z[, 1, drop = F]), coredata(z[, -1, drop F]), window, center = FALSE), replications=3) Which comes out as: test replications elapsed relative user.self sys.self user.child sys.child 1 roll_lm 3 6.273 1 38.312 0.654 0 0 ## You arn't going to get that below... benchmark(fastLm = rollapplyr(z, width = window, function(x) coef(fastLm(cbind(1, x[, -1]), x[, 1])), by.column = FALSE), lm = rollapplyr(z, width = window, function(x) coef(lm(y ~ ., data = as.data.frame(x))), by.column = FALSE), replications=3) On Thu, Jul 21, 2016 at 1:28 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> I would be careful about making assumptions regarding what is faster. > Performance tends to be nonintuitive. > > When I ran rollapply/lm, rollapply/fastLm and roll_lm on the example > you provided rollapply/fastLm was three times faster than roll_lm. Of > course this could change with data of different dimensions but it > would be worthwhile to do actual benchmarks before making assumptions. > > I also noticed that roll_lm did not give the same coefficients as the > other two. > > set.seed(1) > library(zoo) > library(RcppArmadillo) > library(roll) > z <- zoo(matrix(rnorm(10), ncol = 2)) > colnames(z) <- c("y", "x") > > ## rolling regression of width 4 > library(rbenchmark) > benchmark(fastLm = rollapplyr(z, width = 4, > function(x) coef(fastLm(cbind(1, x[, 2]), x[, 1])), > by.column = FALSE), > lm = rollapplyr(z, width = 4, > function(x) coef(lm(y ~ x, data = as.data.frame(x))), > by.column = FALSE), > roll_lm = roll_lm(coredata(z[, 1, drop = F]), coredata(z[, 2, drop > F]), 4, > center = FALSE))[1:4] > > > test replications elapsed relative > 1 fastLm 100 0.22 1.000 > 2 lm 100 0.72 3.273 > 3 roll_lm 100 0.64 2.909 > > On Thu, Jul 21, 2016 at 3:45 PM, jeremiah rounds > <roundsjeremiah at gmail.com> wrote: > > Thanks all. roll::roll_lm was essentially what I wanted. I think > maybe > > I would prefer it to have options to return a few more things, but it is > > the coefficients, and the remaining statistics you might want can be > > calculated fast enough from there. > > > > > > On Thu, Jul 21, 2016 at 12:36 PM, Achim Zeileis < > Achim.Zeileis at uibk.ac.at> > > wrote: > > > >> Jeremiah, > >> > >> for this purpose there are the "roll" and "RcppRoll" packages. Both use > >> Rcpp and the former also provides rolling lm models. The latter has a > >> generic interface that let's you define your own function. > >> > >> One thing to pay attention to, though, is the numerical reliability. > >> Especially on large time series with relatively short windows there is a > >> good chance of encountering numerically challenging situations. The QR > >> decomposition used by lm is fairly robust while other more > straightforward > >> matrix multiplications may not be. This should be kept in mind when > writing > >> your own Rcpp code for plugging it into RcppRoll. > >> > >> But I haven't check what the roll package does and how reliable that > is... > >> > >> hth, > >> Z > >> > >> > >> On Thu, 21 Jul 2016, jeremiah rounds wrote: > >> > >> Hi, > >>> > >>> A not unusual task is performing a multiple regression in a rolling > window > >>> on a time-series. A standard piece of advice for doing in R is > >>> something > >>> like the code that follows at the end of the email. I am currently > using > >>> an "embed" variant of that code and that piece of advice is out there > too. > >>> > >>> But, it occurs to me that for such an easily specified matrix operation > >>> standard R code is really slow. rollapply constantly returns to R > >>> interpreter at each window step for a new lm. All lm is at its heart > is > >>> (X^t X)^(-1) * Xy, and if you think about doing that with Rcpp in > rolling > >>> window you are just incrementing a counter and peeling off rows (or > >>> columns > >>> of X and y) of a particular window size, and following that up with > some > >>> matrix multiplication in a loop. The psuedo-code for that Rcpp > >>> practically writes itself and you might want a wrapper of something > like: > >>> rolling_lm (y=y, x=x, width=4). > >>> > >>> My question is this: has any of the thousands of R packages out there > >>> published anything like that. Rolling window multiple regressions that > >>> stay in C/C++ until the rolling window completes? No sense and > writing it > >>> if it exist. > >>> > >>> > >>> Thanks, > >>> Jeremiah > >>> > >>> Standard (slow) advice for "rolling window regression" follows: > >>> > >>> > >>> set.seed(1) > >>> z <- zoo(matrix(rnorm(10), ncol = 2)) > >>> colnames(z) <- c("y", "x") > >>> > >>> ## rolling regression of width 4 > >>> rollapply(z, width = 4, > >>> function(x) coef(lm(y ~ x, data = as.data.frame(x))), > >>> by.column = FALSE, align = "right") > >>> > >>> ## result is identical to > >>> coef(lm(y ~ x, data = z[1:4,])) > >>> coef(lm(y ~ x, data = z[2:5,])) > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com >[[alternative HTML version deleted]]
Hi Jermiah: another possibly faster way would be to use a kalman filtering framework. I forget the details but duncan and horne have a paper which shows how a regression can be re-computed each time a new data point is added .I forget if they handle taking one off of the back also which is what you need. The paper at the link below isn't the paper I'm talking about but it's reference[1] in that paper. Note that this suggestion might not be a better approach than the various approaches already suggested so I wouldn't go this route unless you're very interested. Mark https://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/recurse.pdf On Thu, Jul 21, 2016 at 4:28 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> I would be careful about making assumptions regarding what is faster. > Performance tends to be nonintuitive. > > When I ran rollapply/lm, rollapply/fastLm and roll_lm on the example > you provided rollapply/fastLm was three times faster than roll_lm. Of > course this could change with data of different dimensions but it > would be worthwhile to do actual benchmarks before making assumptions. > > I also noticed that roll_lm did not give the same coefficients as the > other two. > > set.seed(1) > library(zoo) > library(RcppArmadillo) > library(roll) > z <- zoo(matrix(rnorm(10), ncol = 2)) > colnames(z) <- c("y", "x") > > ## rolling regression of width 4 > library(rbenchmark) > benchmark(fastLm = rollapplyr(z, width = 4, > function(x) coef(fastLm(cbind(1, x[, 2]), x[, 1])), > by.column = FALSE), > lm = rollapplyr(z, width = 4, > function(x) coef(lm(y ~ x, data = as.data.frame(x))), > by.column = FALSE), > roll_lm = roll_lm(coredata(z[, 1, drop = F]), coredata(z[, 2, drop > F]), 4, > center = FALSE))[1:4] > > > test replications elapsed relative > 1 fastLm 100 0.22 1.000 > 2 lm 100 0.72 3.273 > 3 roll_lm 100 0.64 2.909 > > On Thu, Jul 21, 2016 at 3:45 PM, jeremiah rounds > <roundsjeremiah at gmail.com> wrote: > > Thanks all. roll::roll_lm was essentially what I wanted. I think > maybe > > I would prefer it to have options to return a few more things, but it is > > the coefficients, and the remaining statistics you might want can be > > calculated fast enough from there. > > > > > > On Thu, Jul 21, 2016 at 12:36 PM, Achim Zeileis < > Achim.Zeileis at uibk.ac.at> > > wrote: > > > >> Jeremiah, > >> > >> for this purpose there are the "roll" and "RcppRoll" packages. Both use > >> Rcpp and the former also provides rolling lm models. The latter has a > >> generic interface that let's you define your own function. > >> > >> One thing to pay attention to, though, is the numerical reliability. > >> Especially on large time series with relatively short windows there is a > >> good chance of encountering numerically challenging situations. The QR > >> decomposition used by lm is fairly robust while other more > straightforward > >> matrix multiplications may not be. This should be kept in mind when > writing > >> your own Rcpp code for plugging it into RcppRoll. > >> > >> But I haven't check what the roll package does and how reliable that > is... > >> > >> hth, > >> Z > >> > >> > >> On Thu, 21 Jul 2016, jeremiah rounds wrote: > >> > >> Hi, > >>> > >>> A not unusual task is performing a multiple regression in a rolling > window > >>> on a time-series. A standard piece of advice for doing in R is > >>> something > >>> like the code that follows at the end of the email. I am currently > using > >>> an "embed" variant of that code and that piece of advice is out there > too. > >>> > >>> But, it occurs to me that for such an easily specified matrix operation > >>> standard R code is really slow. rollapply constantly returns to R > >>> interpreter at each window step for a new lm. All lm is at its heart > is > >>> (X^t X)^(-1) * Xy, and if you think about doing that with Rcpp in > rolling > >>> window you are just incrementing a counter and peeling off rows (or > >>> columns > >>> of X and y) of a particular window size, and following that up with > some > >>> matrix multiplication in a loop. The psuedo-code for that Rcpp > >>> practically writes itself and you might want a wrapper of something > like: > >>> rolling_lm (y=y, x=x, width=4). > >>> > >>> My question is this: has any of the thousands of R packages out there > >>> published anything like that. Rolling window multiple regressions that > >>> stay in C/C++ until the rolling window completes? No sense and > writing it > >>> if it exist. > >>> > >>> > >>> Thanks, > >>> Jeremiah > >>> > >>> Standard (slow) advice for "rolling window regression" follows: > >>> > >>> > >>> set.seed(1) > >>> z <- zoo(matrix(rnorm(10), ncol = 2)) > >>> colnames(z) <- c("y", "x") > >>> > >>> ## rolling regression of width 4 > >>> rollapply(z, width = 4, > >>> function(x) coef(lm(y ~ x, data = as.data.frame(x))), > >>> by.column = FALSE, align = "right") > >>> > >>> ## result is identical to > >>> coef(lm(y ~ x, data = z[1:4,])) > >>> coef(lm(y ~ x, data = z[2:5,])) > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]