I have a data set that is comprised of, for simplicity, a vector of numbers that I want to march across in overlapping windows of say 10 values each, computing a couple of values for each window. Is there a vectorized way to do this, or do I truly need to resort to looping--I think so? Any other clever thoughts? Thanks, Sean [[alternative HTML version deleted]]
See
?embed
and possibly
?filter
Date: Sat, 20 Mar 2004 15:16:38 -0500
From: Sean Davis <sdavis2 at mail.nih.gov>
To: R-Help <r-help at stat.math.ethz.ch>
Subject: [R] Operating on windows of data
I have a data set that is comprised of, for simplicity, a vector of numbers that
I want to march across in overlapping windows of say 10 values each, computing a
couple of values for each window. Is there a vectorized way to do this, or do I
truly need to resort to looping--I think so? Any other clever thoughts?
Thanks,
Sean
[[alternative HTML version deleted]]
On Sat, 2004-03-20 at 14:16, Sean Davis wrote:> I have a data set that is comprised of, for simplicity, a vector of > numbers that I want to march across in overlapping windows of say 10 > values each, computing a couple of values for each window. Is there a > vectorized way to do this, or do I truly need to resort to looping--I > think so? Any other clever thoughts? > > Thanks, > SeanSee the running() function in the 'gregmisc' package on CRAN or the filter() function in package 'ts', which is part of the standard R installation. HTH, Marc Schwartz
> I have a data set that is comprised of, for simplicity, a vector of > numbers that I want to march across+in overlapping windows of say 10 > values each, computing a couple of values for each window. Is there > +a vectorized way to do this, or do I truly need to resort to > looping--I think so? Any other clever thoughts?I'm not sure this is clever, but I use the subset capabilities of R. See http://www.mayin.org/ajayshah/KB/R/ols.html 3 of the examples there are related to your question, though it's still using loops. :-) -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
You can retain the trick of using subset and still get
rid of the loop in:
http://www.mayin.org/ajayshah/KB/R/EXAMPLES/rollingreg.R
by using sapply like this (untested):
dat <- sapply( seq(T-width), function(i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A,
i:(i+width-1))
details <- summary.lm(model)
tmp <- coefficients(model)
c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4],
R2 = details$r.squared, RMSE = details$sigma )
} )
dat <- as.data.frame(t(dat))
attach(dat)
Date: Mon, 22 Mar 2004 10:28:45 +0530
From: Ajay Shah <ajayshah at mayin.org>
To: <sdavis2 at mail.nih.gov>, <r-help at stat.math.ethz.ch>
Subject: Re: [R] Operating on windows of data
> I have a data set that is comprised of, for simplicity, a vector of
> numbers that I want to march across+in overlapping windows of say 10
> values each, computing a couple of values for each window. Is there
> +a vectorized way to do this, or do I truly need to resort to
> looping--I think so? Any other clever thoughts?
I'm not sure this is clever, but I use the subset capabilities of
R. See
http://www.mayin.org/ajayshah/KB/R/ols.html
3 of the examples there are related to your question, though it's
still using loops. :-)
--
Ajay Shah Consultant
ajayshah at mayin.org Department of Economic Affairs
http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
> On Mon, Mar 22, 2004 at 01:39:28AM -0500, Gabor Grothendieck wrote: > > You can retain the trick of using subset and still get > > rid of the loop in: > > > > http://www.mayin.org/ajayshah/KB/R/EXAMPLES/rollingreg.R > > > > by using sapply like this (untested): > > > > dat <- sapply( seq(T-width), function(i) { > > model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, > > i:(i+width-1)) > > details <- summary.lm(model) > > tmp <- coefficients(model) > > c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], > > R2 = details$r.squared, RMSE = details$sigma ) > > } ) > > dat <- as.data.frame(t(dat)) > > attach(dat) > > This brings me to a question I've always had about "the R way" of > avoiding loops. Yes, the sapply() approach above works. My question > is: Why is this much better than writing it using loops? > > Loops tap into the intuition of millions of people who have grown up > around procedural languages. Atleast to a person like me, I can read > code involving loops effortlessly. > > And I don't see how much faster the sapply() will be. Intuitively, we > may think that the sapply() results in C code getting executed (in the > R sources), while the for loop results in interpretation overhead, and > so the sapply() is surely faster. But when the body of the for loop > involves a weighty thing like a QR decomposition (for the OLS), that > would seem to dominate the cost - as far as I can tell.Its true that vectorizing loops can make it faster but in my mind the main advantage is the conceptual one of working with whole objects at a time and consequently reduction of code size. Admittedly, the example above does not get you much although even here it is slightly shorter than the loop version as it puts the arrays together for you rather than making you set them up yourself. Also, not shown, there are subsequent summary() statements in your file and there would be a further opportunity for code reduction since now your data is in a data frame rather than individual vectors so you could combine all the summary statements into one. Its really not the best example of the desired approach since it chickens out and uses the loop indices to sapply() over but I don't think one can expect a complete win for the vectorized approach in every single case.
Martin Maechler <maechler at stat.math.ethz.ch> wrote that [using apply functions is] not [better than using loops], not at all. He would be right in all he says if time efficiency were the only reason to prefer one coding style to another. Loop-free notatation can reduce the number of variables in scope, making the code easier to read and rearrange. By separating "what to do with the elements" from "how to find the elements", it can lead to pieces which are separately reusable. By reducing the volume of code, it can result in code with fewer mistakes. (Note the word "can" in each of those sentences.)
It appears that I owe Martin Maechler <maechler at stat.math.ethz.ch>
an apology for not realising the importance of the context for what
I quoted. I apologise.
but *please* note again the code snippet we where talking about :
> dat <- sapply( seq(T-width), function(i) {
> model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A,
> i:(i+width-1))
> details <- summary.lm(model)
> tmp <- coefficients(model)
> c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4],
> R2 = details$r.squared, RMSE = details$sigma )
> } )
> dat <- as.data.frame(t(dat))
> attach(dat)
which is really an example where sapply() rather obfuscates than
clarifies.
It's not clear to me that the choice of sapply() -vs- 'for' really
has
anything to do with it here. Hmm, maybe it does. Looking at this code,
I can see at a glance that
- dat will be a matrix
- it will have columns 1:T-width
- it will have rows USD, JPY, DEM, R2, RMSE
- each column reflects one linear model
and I don't have to decode a lot of indexed assignment statements to
figure this out.
The first way to improve clarity would be to use keyword parameters on
the call to lm, e.g., lm(..., data = A, subset = i:(i+width-1)).
The second way to improve clarity would be to use character indices on
tmp rather than integer indices:
coef <- coefficients(model)
c(USD = coef["dlusdchf"],
JPY = coef["djpychf"],
DEM = coef["dldemchf"],
R2 = details$r.squared,
RMSE= details$sigma)
Hmm. My "first" and "second" ways are both the same: use
names rather
than position. There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.
# dfapply(X, FUN, ...) is like sapply() but
# expects FUN to return c(x1=...,xn=...) vectors which it
# turns into rows of the data frame that it returns.
dfapply <- function (...) as.data.frame(t(sapply(...)))
# Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
# and rows 1:T-width, the ith row extracted from a linear
# regression on cases i:(i+width-1).
dat <- dfapply(seq(T-width), function (i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = i:(i+width-1))
s <- summary.lm(model)
v <- coefficients(model)
c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM =
v["dldemchf"],
R2 = s$r.squared, RMSE = s$sigma)
})
Now here's where using sapply() instead of 'for' does pay off, even
here.
We ask the question "where is 'i' used?" Because we're
*not* using i in
any visible index calculations, there is only one place that 'i' is
used,
and that's in the subset= argument of the lm() call.
That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same? Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing. But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.
Far from being an example where vectorization
seems to have only minor advantages, if any, over a
non-vectorized approach, it turns out that this is a shining
example of why vectorization produces higher quality code.
Consider that the loop approach and all the other
solutions so far have an error! The last time point is
never included in any window.
The genesis of the error is that index arithmetic for
producing the loop limits and sapply index is sufficiently
complex that no one noticed the error. Vectorizing
*sufficiently* eliminates the calculation of these limits
entirely and so avoids the possibility of mistake. The
problem so far is that we just did not take the vectorized
approach far enough (mea culpa).
The following solution uses embed (which can be regarded as
a vectorized sliding window) to get us out of considering such
explicit limits. By eliminating the calculation of these limits
it avoids any potential for this sort of error in the first place.
(It also avoids the transpose by producing data frames at
each step and rbinding them.)
do.call( "rbind", apply( embed(T, width), 1,
function(idx) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = idx)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"],
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} )
)
Note that the above example loops over vectors of indices
instead of single indexes. There is an opportunity to do away
with indices altogether although it is at the considerable
expense of space efficiency so, in practice, one would likely
be content with the last solution. Nevertheless, it is of
interest to display the next solution even if only for sake of completing the
thought.
Define embed.data.frame to produce a list of data frames that
represents the sliding windows. Perform a lapply over those:
embed.data.frame <- function(df,width)
apply(embed(1:nrow(df),width),1,function(idx)df[idx,])
do.call( "rbind", lapply(embed.data.frame(df,width),
function(df) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"],
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} )
)
This gets rid of all reference to idx and nrow(df).
---
Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST)
From: Richard A. O'Keefe <ok at cs.otago.ac.nz>
To: <r-help at stat.math.ethz.ch>
Subject: Re: [R] Operating on windows of data
[snip]
Hmm. My "first" and "second" ways are both the same: use
names rather
than position. There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.
# dfapply(X, FUN, ...) is like sapply() but
# expects FUN to return c(x1=...,xn=...) vectors which it
# turns into rows of the data frame that it returns.
dfapply <- function (...) as.data.frame(t(sapply(...)))
# Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
# and rows 1:T-width, the ith row extracted from a linear
# regression on cases i:(i+width-1).
dat <- dfapply(seq(T-width), function (i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = i:(i+width-1))
s <- summary.lm(model)
v <- coefficients(model)
c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM =
v["dldemchf"],
R2 = s$r.squared, RMSE = s$sigma)
})
Now here's where using sapply() instead of 'for' does pay off, even
here.
We ask the question "where is 'i' used?" Because we're
*not* using i in
any visible index calculations, there is only one place that 'i' is
used,
and that's in the subset= argument of the lm() call.
That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same? Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing. But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.
[I am having some problems with returned email. Sorry if
you get this twice.]
Well so much for pontificating. My "shining example" has an
error too although this error is less serious since it will
give an error message. embed(T,width) should be embed(1:T,width).
But perhaps there is more to this than we realize since the
subsequent (space inefficient) solution which I thought might
be taking it too far does not require calculation of T except
in embed.data.frame which is a reusable function that would
likely be of higher quality through reuse and in any case the
fact is that it did avoid this error.
Date: Wed, 24 Mar 2004 00:09:19 -0500 (EST)
From: Gabor Grothendieck <ggrothendieck at myway.com>
To: <ok at cs.otago.ac.nz>, <r-help at stat.math.ethz.ch>
Cc: <ajayshah at mayin.org>
Subject: Re: [R] Operating on windows of data
Far from being an example where vectorization
seems to have only minor advantages, if any, over a
non-vectorized approach, it turns out that this is a shining
example of why vectorization produces higher quality code.
Consider that the loop approach and all the other
solutions so far have an error! The last time point is
never included in any window.
The genesis of the error is that index arithmetic for
producing the loop limits and sapply index is sufficiently
complex that no one noticed the error. Vectorizing
*sufficiently* eliminates the calculation of these limits
entirely and so avoids the possibility of mistake. The
problem so far is that we just did not take the vectorized
approach far enough (mea culpa).
The following solution uses embed (which can be regarded as
a vectorized sliding window) to get us out of considering such
explicit limits. By eliminating the calculation of these limits
it avoids any potential for this sort of error in the first place.
(It also avoids the transpose by producing data frames at
each step and rbinding them.)
do.call( "rbind", apply( embed(T, width), 1,
function(idx) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = idx)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"],
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} )
)
Note that the above example loops over vectors of indices
instead of single indexes. There is an opportunity to do away
with indices altogether although it is at the considerable
expense of space efficiency so, in practice, one would likely
be content with the last solution. Nevertheless, it is of
interest to display the next solution even if only for sake of completing the
thought.
Define embed.data.frame to produce a list of data frames that
represents the sliding windows. Perform a lapply over those:
embed.data.frame <- function(df,width)
apply(embed(1:nrow(df),width),1,function(idx)df[idx,])
do.call( "rbind", lapply(embed.data.frame(df,width),
function(df) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"],
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} )
)
This gets rid of all reference to idx and nrow(df).
---
Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST)
From: Richard A. O'Keefe <ok at cs.otago.ac.nz>
To: <r-help at stat.math.ethz.ch>
Subject: Re: [R] Operating on windows of data
[snip]
Hmm. My "first" and "second" ways are both the same: use
names rather
than position. There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.
# dfapply(X, FUN, ...) is like sapply() but
# expects FUN to return c(x1=...,xn=...) vectors which it
# turns into rows of the data frame that it returns.
dfapply <- function (...) as.data.frame(t(sapply(...)))
# Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
# and rows 1:T-width, the ith row extracted from a linear
# regression on cases i:(i+width-1).
dat <- dfapply(seq(T-width), function (i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = i:(i+width-1))
s <- summary.lm(model)
v <- coefficients(model)
c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM =
v["dldemchf"],
R2 = s$r.squared, RMSE = s$sigma)
})
Now here's where using sapply() instead of 'for' does pay off, even
here.
We ask the question "where is 'i' used?" Because we're
*not* using i in
any visible index calculations, there is only one place that 'i' is
used,
and that's in the subset= argument of the lm() call.
That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same? Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing. But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.