I have a data set that is comprised of, for simplicity, a vector of numbers that I want to march across in overlapping windows of say 10 values each, computing a couple of values for each window. Is there a vectorized way to do this, or do I truly need to resort to looping--I think so? Any other clever thoughts? Thanks, Sean [[alternative HTML version deleted]]
See ?embed and possibly ?filter Date: Sat, 20 Mar 2004 15:16:38 -0500 From: Sean Davis <sdavis2 at mail.nih.gov> To: R-Help <r-help at stat.math.ethz.ch> Subject: [R] Operating on windows of data I have a data set that is comprised of, for simplicity, a vector of numbers that I want to march across in overlapping windows of say 10 values each, computing a couple of values for each window. Is there a vectorized way to do this, or do I truly need to resort to looping--I think so? Any other clever thoughts? Thanks, Sean [[alternative HTML version deleted]]
On Sat, 2004-03-20 at 14:16, Sean Davis wrote:> I have a data set that is comprised of, for simplicity, a vector of > numbers that I want to march across in overlapping windows of say 10 > values each, computing a couple of values for each window. Is there a > vectorized way to do this, or do I truly need to resort to looping--I > think so? Any other clever thoughts? > > Thanks, > SeanSee the running() function in the 'gregmisc' package on CRAN or the filter() function in package 'ts', which is part of the standard R installation. HTH, Marc Schwartz
> I have a data set that is comprised of, for simplicity, a vector of > numbers that I want to march across+in overlapping windows of say 10 > values each, computing a couple of values for each window. Is there > +a vectorized way to do this, or do I truly need to resort to > looping--I think so? Any other clever thoughts?I'm not sure this is clever, but I use the subset capabilities of R. See http://www.mayin.org/ajayshah/KB/R/ols.html 3 of the examples there are related to your question, though it's still using loops. :-) -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
You can retain the trick of using subset and still get rid of the loop in: http://www.mayin.org/ajayshah/KB/R/EXAMPLES/rollingreg.R by using sapply like this (untested): dat <- sapply( seq(T-width), function(i) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, i:(i+width-1)) details <- summary.lm(model) tmp <- coefficients(model) c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], R2 = details$r.squared, RMSE = details$sigma ) } ) dat <- as.data.frame(t(dat)) attach(dat) Date: Mon, 22 Mar 2004 10:28:45 +0530 From: Ajay Shah <ajayshah at mayin.org> To: <sdavis2 at mail.nih.gov>, <r-help at stat.math.ethz.ch> Subject: Re: [R] Operating on windows of data> I have a data set that is comprised of, for simplicity, a vector of > numbers that I want to march across+in overlapping windows of say 10 > values each, computing a couple of values for each window. Is there > +a vectorized way to do this, or do I truly need to resort to > looping--I think so? Any other clever thoughts?I'm not sure this is clever, but I use the subset capabilities of R. See http://www.mayin.org/ajayshah/KB/R/ols.html 3 of the examples there are related to your question, though it's still using loops. :-) -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
> On Mon, Mar 22, 2004 at 01:39:28AM -0500, Gabor Grothendieck wrote: > > You can retain the trick of using subset and still get > > rid of the loop in: > > > > http://www.mayin.org/ajayshah/KB/R/EXAMPLES/rollingreg.R > > > > by using sapply like this (untested): > > > > dat <- sapply( seq(T-width), function(i) { > > model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, > > i:(i+width-1)) > > details <- summary.lm(model) > > tmp <- coefficients(model) > > c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], > > R2 = details$r.squared, RMSE = details$sigma ) > > } ) > > dat <- as.data.frame(t(dat)) > > attach(dat) > > This brings me to a question I've always had about "the R way" of > avoiding loops. Yes, the sapply() approach above works. My question > is: Why is this much better than writing it using loops? > > Loops tap into the intuition of millions of people who have grown up > around procedural languages. Atleast to a person like me, I can read > code involving loops effortlessly. > > And I don't see how much faster the sapply() will be. Intuitively, we > may think that the sapply() results in C code getting executed (in the > R sources), while the for loop results in interpretation overhead, and > so the sapply() is surely faster. But when the body of the for loop > involves a weighty thing like a QR decomposition (for the OLS), that > would seem to dominate the cost - as far as I can tell.Its true that vectorizing loops can make it faster but in my mind the main advantage is the conceptual one of working with whole objects at a time and consequently reduction of code size. Admittedly, the example above does not get you much although even here it is slightly shorter than the loop version as it puts the arrays together for you rather than making you set them up yourself. Also, not shown, there are subsequent summary() statements in your file and there would be a further opportunity for code reduction since now your data is in a data frame rather than individual vectors so you could combine all the summary statements into one. Its really not the best example of the desired approach since it chickens out and uses the loop indices to sapply() over but I don't think one can expect a complete win for the vectorized approach in every single case.
Martin Maechler <maechler at stat.math.ethz.ch> wrote that [using apply functions is] not [better than using loops], not at all. He would be right in all he says if time efficiency were the only reason to prefer one coding style to another. Loop-free notatation can reduce the number of variables in scope, making the code easier to read and rearrange. By separating "what to do with the elements" from "how to find the elements", it can lead to pieces which are separately reusable. By reducing the volume of code, it can result in code with fewer mistakes. (Note the word "can" in each of those sentences.)
It appears that I owe Martin Maechler <maechler at stat.math.ethz.ch> an apology for not realising the importance of the context for what I quoted. I apologise. but *please* note again the code snippet we where talking about : > dat <- sapply( seq(T-width), function(i) { > model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, > i:(i+width-1)) > details <- summary.lm(model) > tmp <- coefficients(model) > c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], > R2 = details$r.squared, RMSE = details$sigma ) > } ) > dat <- as.data.frame(t(dat)) > attach(dat) which is really an example where sapply() rather obfuscates than clarifies. It's not clear to me that the choice of sapply() -vs- 'for' really has anything to do with it here. Hmm, maybe it does. Looking at this code, I can see at a glance that - dat will be a matrix - it will have columns 1:T-width - it will have rows USD, JPY, DEM, R2, RMSE - each column reflects one linear model and I don't have to decode a lot of indexed assignment statements to figure this out. The first way to improve clarity would be to use keyword parameters on the call to lm, e.g., lm(..., data = A, subset = i:(i+width-1)). The second way to improve clarity would be to use character indices on tmp rather than integer indices: coef <- coefficients(model) c(USD = coef["dlusdchf"], JPY = coef["djpychf"], DEM = coef["dldemchf"], R2 = details$r.squared, RMSE= details$sigma) Hmm. My "first" and "second" ways are both the same: use names rather than position. There is one more clarity improvement to recommend, and it has nothing to do with using or avoiding sapply(), at least not directly. # dfapply(X, FUN, ...) is like sapply() but # expects FUN to return c(x1=...,xn=...) vectors which it # turns into rows of the data frame that it returns. dfapply <- function (...) as.data.frame(t(sapply(...))) # Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE # and rows 1:T-width, the ith row extracted from a linear # regression on cases i:(i+width-1). dat <- dfapply(seq(T-width), function (i) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = i:(i+width-1)) s <- summary.lm(model) v <- coefficients(model) c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) }) Now here's where using sapply() instead of 'for' does pay off, even here. We ask the question "where is 'i' used?" Because we're *not* using i in any visible index calculations, there is only one place that 'i' is used, and that's in the subset= argument of the lm() call. That prompts the question "is there any way to exploit the fact that the rest of the linear model is the same? Depending on the relative sizes of A and T-width, there may well be, and Statistical Models in S explains, if memory serves me, how to do this kind of thing. But without the fact that i is only used in one place, it might not be as obvious that it was worth thinking about.
Far from being an example where vectorization seems to have only minor advantages, if any, over a non-vectorized approach, it turns out that this is a shining example of why vectorization produces higher quality code. Consider that the loop approach and all the other solutions so far have an error! The last time point is never included in any window. The genesis of the error is that index arithmetic for producing the loop limits and sapply index is sufficiently complex that no one noticed the error. Vectorizing *sufficiently* eliminates the calculation of these limits entirely and so avoids the possibility of mistake. The problem so far is that we just did not take the vectorized approach far enough (mea culpa). The following solution uses embed (which can be regarded as a vectorized sliding window) to get us out of considering such explicit limits. By eliminating the calculation of these limits it avoids any potential for this sort of error in the first place. (It also avoids the transpose by producing data frames at each step and rbinding them.) do.call( "rbind", apply( embed(T, width), 1, function(idx) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = idx) s <- summary.lm(model) v <- coefficients(model) data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) } ) ) Note that the above example loops over vectors of indices instead of single indexes. There is an opportunity to do away with indices altogether although it is at the considerable expense of space efficiency so, in practice, one would likely be content with the last solution. Nevertheless, it is of interest to display the next solution even if only for sake of completing the thought. Define embed.data.frame to produce a list of data frames that represents the sliding windows. Perform a lapply over those: embed.data.frame <- function(df,width) apply(embed(1:nrow(df),width),1,function(idx)df[idx,]) do.call( "rbind", lapply(embed.data.frame(df,width), function(df) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df) s <- summary.lm(model) v <- coefficients(model) data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) } ) ) This gets rid of all reference to idx and nrow(df). --- Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST) From: Richard A. O'Keefe <ok at cs.otago.ac.nz> To: <r-help at stat.math.ethz.ch> Subject: Re: [R] Operating on windows of data [snip] Hmm. My "first" and "second" ways are both the same: use names rather than position. There is one more clarity improvement to recommend, and it has nothing to do with using or avoiding sapply(), at least not directly. # dfapply(X, FUN, ...) is like sapply() but # expects FUN to return c(x1=...,xn=...) vectors which it # turns into rows of the data frame that it returns. dfapply <- function (...) as.data.frame(t(sapply(...))) # Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE # and rows 1:T-width, the ith row extracted from a linear # regression on cases i:(i+width-1). dat <- dfapply(seq(T-width), function (i) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = i:(i+width-1)) s <- summary.lm(model) v <- coefficients(model) c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) }) Now here's where using sapply() instead of 'for' does pay off, even here. We ask the question "where is 'i' used?" Because we're *not* using i in any visible index calculations, there is only one place that 'i' is used, and that's in the subset= argument of the lm() call. That prompts the question "is there any way to exploit the fact that the rest of the linear model is the same? Depending on the relative sizes of A and T-width, there may well be, and Statistical Models in S explains, if memory serves me, how to do this kind of thing. But without the fact that i is only used in one place, it might not be as obvious that it was worth thinking about.
[I am having some problems with returned email. Sorry if you get this twice.] Well so much for pontificating. My "shining example" has an error too although this error is less serious since it will give an error message. embed(T,width) should be embed(1:T,width). But perhaps there is more to this than we realize since the subsequent (space inefficient) solution which I thought might be taking it too far does not require calculation of T except in embed.data.frame which is a reusable function that would likely be of higher quality through reuse and in any case the fact is that it did avoid this error. Date: Wed, 24 Mar 2004 00:09:19 -0500 (EST) From: Gabor Grothendieck <ggrothendieck at myway.com> To: <ok at cs.otago.ac.nz>, <r-help at stat.math.ethz.ch> Cc: <ajayshah at mayin.org> Subject: Re: [R] Operating on windows of data Far from being an example where vectorization seems to have only minor advantages, if any, over a non-vectorized approach, it turns out that this is a shining example of why vectorization produces higher quality code. Consider that the loop approach and all the other solutions so far have an error! The last time point is never included in any window. The genesis of the error is that index arithmetic for producing the loop limits and sapply index is sufficiently complex that no one noticed the error. Vectorizing *sufficiently* eliminates the calculation of these limits entirely and so avoids the possibility of mistake. The problem so far is that we just did not take the vectorized approach far enough (mea culpa). The following solution uses embed (which can be regarded as a vectorized sliding window) to get us out of considering such explicit limits. By eliminating the calculation of these limits it avoids any potential for this sort of error in the first place. (It also avoids the transpose by producing data frames at each step and rbinding them.) do.call( "rbind", apply( embed(T, width), 1, function(idx) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = idx) s <- summary.lm(model) v <- coefficients(model) data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) } ) ) Note that the above example loops over vectors of indices instead of single indexes. There is an opportunity to do away with indices altogether although it is at the considerable expense of space efficiency so, in practice, one would likely be content with the last solution. Nevertheless, it is of interest to display the next solution even if only for sake of completing the thought. Define embed.data.frame to produce a list of data frames that represents the sliding windows. Perform a lapply over those: embed.data.frame <- function(df,width) apply(embed(1:nrow(df),width),1,function(idx)df[idx,]) do.call( "rbind", lapply(embed.data.frame(df,width), function(df) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df) s <- summary.lm(model) v <- coefficients(model) data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) } ) ) This gets rid of all reference to idx and nrow(df). --- Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST) From: Richard A. O'Keefe <ok at cs.otago.ac.nz> To: <r-help at stat.math.ethz.ch> Subject: Re: [R] Operating on windows of data [snip] Hmm. My "first" and "second" ways are both the same: use names rather than position. There is one more clarity improvement to recommend, and it has nothing to do with using or avoiding sapply(), at least not directly. # dfapply(X, FUN, ...) is like sapply() but # expects FUN to return c(x1=...,xn=...) vectors which it # turns into rows of the data frame that it returns. dfapply <- function (...) as.data.frame(t(sapply(...))) # Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE # and rows 1:T-width, the ith row extracted from a linear # regression on cases i:(i+width-1). dat <- dfapply(seq(T-width), function (i) { model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = A, subset = i:(i+width-1)) s <- summary.lm(model) v <- coefficients(model) c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma) }) Now here's where using sapply() instead of 'for' does pay off, even here. We ask the question "where is 'i' used?" Because we're *not* using i in any visible index calculations, there is only one place that 'i' is used, and that's in the subset= argument of the lm() call. That prompts the question "is there any way to exploit the fact that the rest of the linear model is the same? Depending on the relative sizes of A and T-width, there may well be, and Statistical Models in S explains, if memory serves me, how to do this kind of thing. But without the fact that i is only used in one place, it might not be as obvious that it was worth thinking about.