thr3ads.net - R help - [R] Operating on windows of data [Mar 2004]

If this information is useful, please help other people find it:
Share via:

Sean Davis

2004-Mar-20 20:16 UTC

[R] Operating on windows of data

I have a data set that is comprised of, for simplicity, a vector of numbers that
I want to march across in overlapping windows of say 10 values each, computing a
couple of values for each window.  Is there a vectorized way to do this, or do I
truly need to resort to looping--I think so?  Any other clever thoughts?

Thanks,
Sean

	[[alternative HTML version deleted]]

Gabor Grothendieck

2004-Mar-20 20:37 UTC

head link

[R] Operating on windows of data

See

   ?embed

and possibly

   ?filter


Date:   Sat, 20 Mar 2004 15:16:38 -0500 
From:   Sean Davis <sdavis2 at mail.nih.gov>
To:   R-Help <r-help at stat.math.ethz.ch> 
Subject:   [R] Operating on windows of data 

 
I have a data set that is comprised of, for simplicity, a vector of numbers that
I want to march across in overlapping windows of say 10 values each, computing a
couple of values for each window. Is there a vectorized way to do this, or do I
truly need to resort to looping--I think so? Any other clever thoughts?

Thanks,
Sean

     [[alternative HTML version deleted]]

Marc Schwartz

2004-Mar-20 20:38 UTC

head link

[R] Operating on windows of data

On Sat, 2004-03-20 at 14:16, Sean Davis wrote:> I have a data set that is comprised of, for simplicity, a vector of
> numbers that I want to march across in overlapping windows of say 10
> values each, computing a couple of values for each window.  Is there a
> vectorized way to do this, or do I truly need to resort to looping--I
> think so?  Any other clever thoughts?
> 
> Thanks,
> Sean

See the running() function in the 'gregmisc' package on CRAN or the
filter() function in package 'ts', which is part of the standard R
installation.

HTH,

Marc Schwartz

Ajay Shah

2004-Mar-22 04:58 UTC

head link

[R] Operating on windows of data

> I have a data set that is comprised of, for simplicity, a vector of
> numbers that I want to march across+in overlapping windows of say 10
> values each, computing a couple of values for each window.  Is there
> +a vectorized way to do this, or do I truly need to resort to
> looping--I think so?  Any other clever thoughts?
I'm not sure this is clever, but I use the subset capabilities of
R. See

http://www.mayin.org/ajayshah/KB/R/ols.html

3 of the examples there are related to your question, though it's
still using loops. :-)

-- 
Ajay Shah                                                   Consultant
ajayshah at mayin.org                      Department of Economic Affairs
http://www.mayin.org/ajayshah           Ministry of Finance, New Delhi

Gabor Grothendieck

2004-Mar-22 06:39 UTC

head link

[R] Operating on windows of data

You can retain the trick of using subset and still get
rid of the loop in:

   http://www.mayin.org/ajayshah/KB/R/EXAMPLES/rollingreg.R

by using sapply like this (untested):

dat <- sapply( seq(T-width), function(i) {
    model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, 
                i:(i+width-1))
    details <- summary.lm(model)
    tmp <- coefficients(model)
    c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], 
           R2 = details$r.squared, RMSE = details$sigma )
} )
dat <- as.data.frame(t(dat))
attach(dat)


Date:   Mon, 22 Mar 2004 10:28:45 +0530 
From:   Ajay Shah <ajayshah at mayin.org>
To:   <sdavis2 at mail.nih.gov>, <r-help at stat.math.ethz.ch> 
Subject:   Re: [R] Operating on windows of data 

 > I have a data set that is comprised of, for simplicity, a vector of
> numbers that I want to march across+in overlapping windows of say 10
> values each, computing a couple of values for each window. Is there
> +a vectorized way to do this, or do I truly need to resort to
> looping--I think so? Any other clever thoughts?
I'm not sure this is clever, but I use the subset capabilities of
R. See

http://www.mayin.org/ajayshah/KB/R/ols.html

3 of the examples there are related to your question, though it's
still using loops. :-)

-- 
Ajay Shah Consultant
ajayshah at mayin.org Department of Economic Affairs
http://www.mayin.org/ajayshah Ministry of Finance, New Delhi

Gabor Grothendieck

2004-Mar-22 13:33 UTC

head link

[R] Operating on windows of data

> On Mon, Mar 22, 2004 at 01:39:28AM -0500, Gabor Grothendieck wrote:
> > You can retain the trick of using subset and still get
> > rid of the loop in:
> > 
> > http://www.mayin.org/ajayshah/KB/R/EXAMPLES/rollingreg.R
> > 
> > by using sapply like this (untested):
> > 
> > dat <- sapply( seq(T-width), function(i) {
> > model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, 
> > i:(i+width-1))
> > details <- summary.lm(model)
> > tmp <- coefficients(model)
> > c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], 
> > R2 = details$r.squared, RMSE = details$sigma )
> > } )
> > dat <- as.data.frame(t(dat))
> > attach(dat)
> 
> This brings me to a question I've always had about "the R
way" of
> avoiding loops. Yes, the sapply() approach above works. My question
> is: Why is this much better than writing it using loops?
> 
> Loops tap into the intuition of millions of people who have grown up
> around procedural languages. Atleast to a person like me, I can read
> code involving loops effortlessly.
> 
> And I don't see how much faster the sapply() will be. Intuitively, we
> may think that the sapply() results in C code getting executed (in the
> R sources), while the for loop results in interpretation overhead, and
> so the sapply() is surely faster. But when the body of the for loop
> involves a weighty thing like a QR decomposition (for the OLS), that
> would seem to dominate the cost - as far as I can tell.
Its true that vectorizing loops can make it faster but in my
mind the main advantage is the conceptual one of working with
whole objects at a time and consequently reduction of code size.

Admittedly, the example above does not get you much although
even here it is slightly shorter than the loop version as it 
puts the arrays together for you rather than making you set them 
up yourself.  Also, not shown, there are subsequent 
summary() statements in your file and there would be
a further opportunity for code reduction 
since now your data is in a data frame rather than 
individual vectors so you could combine all the summary statements
into one.

Its really not the best example of the desired approach since 
it chickens out and uses the loop indices to sapply() over but 
I don't think one can expect a complete win for the 
vectorized approach in every single case.

Richard A. O'Keefe

2004-Mar-23 01:16 UTC

head link

[R] Operating on windows of data

Martin Maechler <maechler at stat.math.ethz.ch> wrote that
	[using apply functions is] not [better than using loops],
	not at all.

He would be right in all he says if time efficiency were the only
reason to prefer one coding style to another.

Loop-free notatation can reduce the number of variables in scope,
making the code easier to read and rearrange.  By separating "what to
do with the elements" from "how to find the elements", it can
lead to
pieces which are separately reusable.  By reducing the volume of
code, it can result in code with fewer mistakes.

(Note the word "can" in each of those sentences.)

Richard A. O'Keefe

2004-Mar-24 02:56 UTC

head link

[R] Operating on windows of data

It appears that I owe Martin Maechler <maechler at stat.math.ethz.ch>
an apology for not realising the importance of the context for what
I quoted.  I apologise.

	but *please* note again the code snippet we where talking about :
	
	    > dat <- sapply( seq(T-width), function(i) {
	    >     model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, 
	    >                 i:(i+width-1))
	    >     details <- summary.lm(model)
	    >     tmp <- coefficients(model)
	    >     c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], 
	    >            R2 = details$r.squared, RMSE = details$sigma )
	    > } )
	    > dat <- as.data.frame(t(dat))
	    > attach(dat)
	
	which is really an example where sapply() rather obfuscates than
	clarifies.

It's not clear to me that the choice of sapply() -vs- 'for' really
has
anything to do with it here.  Hmm, maybe it does.  Looking at this code,
I can see at a glance that
    - dat will be a matrix
    - it will have columns 1:T-width
    - it will have rows USD, JPY, DEM, R2, RMSE
    - each column reflects one linear model
and I don't have to decode a lot of indexed assignment statements to
figure this out.

The first way to improve clarity would be to use keyword parameters on
the call to lm, e.g., lm(..., data = A, subset = i:(i+width-1)).

The second way to improve clarity would be to use character indices on
tmp rather than integer indices:
	
	coef <- coefficients(model)
	c(USD = coef["dlusdchf"],
	  JPY = coef["djpychf"],
	  DEM = coef["dldemchf"],
	  R2  = details$r.squared,
	  RMSE= details$sigma)

Hmm.  My "first" and "second" ways are both the same: use
names rather
than position.  There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.

    # dfapply(X, FUN, ...) is like sapply() but
    # expects FUN to return c(x1=...,xn=...) vectors which it
    # turns into rows of the data frame that it returns.

    dfapply <- function (...) as.data.frame(t(sapply(...)))
 
    # Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
    # and rows 1:T-width, the ith row extracted from a linear
    # regression on cases i:(i+width-1).

    dat <- dfapply(seq(T-width), function (i) {
        model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
		    data = A, subset = i:(i+width-1))
	s <- summary.lm(model)
	v <- coefficients(model)
	c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM =
v["dldemchf"],
	  R2 = s$r.squared, RMSE = s$sigma)
    })

Now here's where using sapply() instead of 'for' does pay off, even
here.
We ask the question "where is 'i' used?"  Because we're
*not* using i in
any visible index calculations, there is only one place that 'i' is
used,
and that's in the subset= argument of the lm() call.

That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same?  Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing.  But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.

Gabor Grothendieck

2004-Mar-24 05:09 UTC

head link

[R] Operating on windows of data

Far from being an example where vectorization
seems to have only minor advantages, if any, over a 
non-vectorized approach, it turns out that this is a shining 
example of why vectorization produces higher quality code.

Consider that the loop approach and all the other
solutions so far have an error!  The last time point is
never included in any window.

The genesis of the error is that index arithmetic for
producing the loop limits and sapply index is sufficiently
complex that no one noticed the error.  Vectorizing
*sufficiently* eliminates the calculation of these limits
entirely and so avoids the possibility of mistake.  The
problem so far is that we just did not take the vectorized
approach far enough (mea culpa).

The following solution uses embed (which can be regarded as
a vectorized sliding window) to get us out of considering such
explicit limits. By eliminating the calculation of these limits
it avoids any potential for this sort of error in the first place.
(It also avoids the transpose by producing data frames at
each step and rbinding them.)


do.call( "rbind", apply( embed(T, width), 1, 
   function(idx) {
      model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
		   data = A, subset = idx)
      s <- summary.lm(model)
      v <- coefficients(model)
      data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], 
              DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
   } ) 
)

Note that the above example loops over  vectors of indices
instead of single indexes.  There is an opportunity to do away 
with indices altogether although it is at the considerable
expense of space efficiency so, in practice, one would likely 
be content with the last solution.  Nevertheless, it is of 
interest to display the next solution even if only for sake of completing the
thought.

Define embed.data.frame to produce a list of data frames that
represents the sliding windows.  Perform a lapply over those:

embed.data.frame <- function(df,width) 
             apply(embed(1:nrow(df),width),1,function(idx)df[idx,])

do.call( "rbind", lapply(embed.data.frame(df,width), 
  function(df) {
    model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df)
    s <- summary.lm(model)
    v <- coefficients(model)
    data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], 
          DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
  } )
)

This gets rid of all reference to idx and nrow(df).


---

Date:   Wed, 24 Mar 2004 14:56:46 +1200 (NZST) 
From:   Richard A. O'Keefe <ok at cs.otago.ac.nz>
To:   <r-help at stat.math.ethz.ch> 
Subject:   Re: [R] Operating on windows of data 

[snip]

Hmm. My "first" and "second" ways are both the same: use
names rather
than position. There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.

# dfapply(X, FUN, ...) is like sapply() but
# expects FUN to return c(x1=...,xn=...) vectors which it
# turns into rows of the data frame that it returns.

dfapply <- function (...) as.data.frame(t(sapply(...)))

# Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
# and rows 1:T-width, the ith row extracted from a linear
# regression on cases i:(i+width-1).

dat <- dfapply(seq(T-width), function (i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
           data = A, subset = i:(i+width-1))
     s <- summary.lm(model)
     v <- coefficients(model)
     c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM =
v["dldemchf"],
      R2 = s$r.squared, RMSE = s$sigma)
})

Now here's where using sapply() instead of 'for' does pay off, even
here.
We ask the question "where is 'i' used?" Because we're
*not* using i in
any visible index calculations, there is only one place that 'i' is
used,
and that's in the subset= argument of the lm() call.

That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same? Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing. But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.

Gabor Grothendieck

2004-Mar-24 05:50 UTC

head link

[R] Operating on windows of data

[I am having some problems with returned email.  Sorry if
you get this twice.]

Well so much for pontificating. My "shining example" has an 
error too although this error is less serious since it will
give an error message. embed(T,width) should be embed(1:T,width).

But perhaps there is more to this than we realize since the
subsequent (space inefficient) solution which I thought might 
be taking it too far does not require calculation of T except
in embed.data.frame which is a reusable function that would
likely be of higher quality through reuse and in any case the
fact is that it did avoid this error.


Date:   Wed, 24 Mar 2004 00:09:19 -0500 (EST) 
From:   Gabor Grothendieck <ggrothendieck at myway.com>
To:   <ok at cs.otago.ac.nz>, <r-help at stat.math.ethz.ch> 
Cc:   <ajayshah at mayin.org> 
Subject:   Re: [R] Operating on windows of data 

 


Far from being an example where vectorization
seems to have only minor advantages, if any, over a 
non-vectorized approach, it turns out that this is a shining 
example of why vectorization produces higher quality code.

Consider that the loop approach and all the other
solutions so far have an error! The last time point is
never included in any window.

The genesis of the error is that index arithmetic for
producing the loop limits and sapply index is sufficiently
complex that no one noticed the error. Vectorizing
*sufficiently* eliminates the calculation of these limits
entirely and so avoids the possibility of mistake. The
problem so far is that we just did not take the vectorized
approach far enough (mea culpa).

The following solution uses embed (which can be regarded as
a vectorized sliding window) to get us out of considering such
explicit limits. By eliminating the calculation of these limits
it avoids any potential for this sort of error in the first place.
(It also avoids the transpose by producing data frames at
each step and rbinding them.)


do.call( "rbind", apply( embed(T, width), 1, 
function(idx) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
           data = A, subset = idx)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], 
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} ) 
)

Note that the above example loops over vectors of indices
instead of single indexes. There is an opportunity to do away 
with indices altogether although it is at the considerable
expense of space efficiency so, in practice, one would likely 
be content with the last solution. Nevertheless, it is of 
interest to display the next solution even if only for sake of completing the
thought.

Define embed.data.frame to produce a list of data frames that
represents the sliding windows. Perform a lapply over those:

embed.data.frame <- function(df,width) 
apply(embed(1:nrow(df),width),1,function(idx)df[idx,])

do.call( "rbind", lapply(embed.data.frame(df,width), 
function(df) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, data = df)
s <- summary.lm(model)
v <- coefficients(model)
data.frame(USD = v["dlusdchf"], JPY = v["djpychf"], 
DEM = v["dldemchf"], R2 = s$r.squared, RMSE = s$sigma)
} )
)

This gets rid of all reference to idx and nrow(df).


---

Date: Wed, 24 Mar 2004 14:56:46 +1200 (NZST) 
From: Richard A. O'Keefe <ok at cs.otago.ac.nz>
To: <r-help at stat.math.ethz.ch> 
Subject: Re: [R] Operating on windows of data 

[snip]

Hmm. My "first" and "second" ways are both the same: use
names rather
than position. There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.

# dfapply(X, FUN, ...) is like sapply() but
# expects FUN to return c(x1=...,xn=...) vectors which it
# turns into rows of the data frame that it returns.

dfapply <- function (...) as.data.frame(t(sapply(...)))

# Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
# and rows 1:T-width, the ith row extracted from a linear
# regression on cases i:(i+width-1).

dat <- dfapply(seq(T-width), function (i) {
model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
data = A, subset = i:(i+width-1))
s <- summary.lm(model)
v <- coefficients(model)
c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM =
v["dldemchf"],
R2 = s$r.squared, RMSE = s$sigma)
})

Now here's where using sapply() instead of 'for' does pay off, even
here.
We ask the question "where is 'i' used?" Because we're
*not* using i in
any visible index calculations, there is only one place that 'i' is
used,
and that's in the subset= argument of the lm() call.

That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same? Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing. But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.

Seemingly Similar Threads

Search for more maybe matching threads

R help - Mar 2004 - Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

[R] Operating on windows of data

Seemingly Similar Threads