thr3ads.net - R help - [R] Moving window regressions - how can I improve this code? [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Ajay Shah

2004-Apr-24 05:09 UTC

[R] Moving window regressions - how can I improve this code?

I wrote a function which does "moving window" regressions. E.g. if
there are 100 observations and the window width is 50, then I first
run the regression for observations 1..50, then for 2..51, and so on.

I am extremely pleased with R in my experience with writing this,
since I was able to pass the model as an argument into the function
:-) Forgive me if I sound naive, but that's rocket science to me!!

For a regression with K explanatory variables, I make a matrix with
2*K+2 columns, where I capture:
    K coefficients and K standard errors
    the residual sigma
    R^2

My code is:

   # ------------------------------------------------------------ 
   movingWindowRegression <- function(data, T, width, model, K) {
     results = matrix(nrow=T, ncol=2*K+2)
     for (i in width:T) {
       details <- summary.lm(lm(as.formula(model), data[(i-width+1):i,]))
       n=1;
       for (j in 1:K) {
         results[i, n]   = details$coefficients[j, 1]
         results[i, n+1] = details$coefficients[j, 2]
         n = n + 2
       }
       results[i, n] = details$sigma
       results[i, n+1] = details$r.squared
     }
     return(results)
   }
   
   # Simulate some data for a linear regression
   T = 20
   x = runif(T); y = 2 + 3*x + rnorm(T);
   D = data.frame(x, y)
   
   r = movingWindowRegression(D, T=T, width=10, model="y ~ x", K=2)
   print(r)
   # ------------------------------------------------------------ 

I would be very happy if you could look at this and tell me how to do
things better.

I have two specific questions:

  1. I find it silly that I have to manually pass K and T into the
     function. It would be so much nicer to have:

        r = movingWindowRegression(D,      width=10, model="y ~ x")
     instead of the existing
        r = movingWindowRegression(D, T=T, width=10, model="y ~ x",
K=2)

     How can the function inspect the data frame D and learn the
     number of rows?

     How can the function inspect the model specification string and
     learn K, the number of explanatory variables?

  2. "The R way" consists of avoiding loops when the code is
     vectorisable. I am using a loop to copy out from
     details$coefficients into the columns of results[i,]. Is there a
     better way?

-- 
Ajay Shah                                                   Consultant
ajayshah at mayin.org                      Department of Economic Affairs
http://www.mayin.org/ajayshah           Ministry of Finance, New Delhi

Gabor Grothendieck

2004-Apr-24 06:53 UTC

head link

[R] Moving window regressions - how can I improve this code?

1. ?nrow
2. ?all.vars

Getting rid of loop was discussed on Mar 22 and there was some debate
on whether or not it was a good idea although it turned out there was
a bug in the code that only the loop free version brought out.  
See archives.

Ajay Shah <ajayshah <at> mayin.org> writes:

: 
: I wrote a function which does "moving window" regressions. E.g. if
: there are 100 observations and the window width is 50, then I first
: run the regression for observations 1..50, then for 2..51, and so on.
: 
: I am extremely pleased with R in my experience with writing this,
: since I was able to pass the model as an argument into the function
:  Forgive me if I sound naive, but that's rocket science to me!!
: 
: For a regression with K explanatory variables, I make a matrix with
: 2*K+2 columns, where I capture:
:     K coefficients and K standard errors
:     the residual sigma
:     R^2
: 
: My code is:
: 
:    # ------------------------------------------------------------ 
:    movingWindowRegression <- function(data, T, width, model, K) {
:      results = matrix(nrow=T, ncol=2*K+2)
:      for (i in width:T) {
:        details <- summary.lm(lm(as.formula(model), data[(i-width+1):i,]))
:        n=1;
:        for (j in 1:K) {
:          results[i, n]   = details$coefficients[j, 1]
:          results[i, n+1] = details$coefficients[j, 2]
:          n = n + 2
:        }
:        results[i, n] = details$sigma
:        results[i, n+1] = details$r.squared
:      }
:      return(results)
:    }
: 
:    # Simulate some data for a linear regression
:    T = 20
:    x = runif(T); y = 2 + 3*x + rnorm(T);
:    D = data.frame(x, y)
: 
:    r = movingWindowRegression(D, T=T, width=10, model="y ~ x", K=2)
:    print(r)
:    # ------------------------------------------------------------ 
: 
: I would be very happy if you could look at this and tell me how to do
: things better.
: 
: I have two specific questions:
: 
:   1. I find it silly that I have to manually pass K and T into the
:      function. It would be so much nicer to have:
: 
:         r = movingWindowRegression(D,      width=10, model="y ~ x")
:      instead of the existing
:         r = movingWindowRegression(D, T=T, width=10, model="y ~ x",
K=2)
: 
:      How can the function inspect the data frame D and learn the
:      number of rows?
: 
:      How can the function inspect the model specification string and
:      learn K, the number of explanatory variables?
: 
:   2. "The R way" consists of avoiding loops when the code is
:      vectorisable. I am using a loop to copy out from
:      details$coefficients into the columns of results[i,]. Is there a
:      better way?
:

Douglas Bates

2004-Apr-24 13:58 UTC

head link

[R] Moving window regressions - how can I improve this code?

Ajay Shah <ajayshah at mayin.org> writes:
> I wrote a function which does "moving window" regressions. E.g.
if
> there are 100 observations and the window width is 50, then I first
> run the regression for observations 1..50, then for 2..51, and so on.
> 
> I am extremely pleased with R in my experience with writing this,
> since I was able to pass the model as an argument into the function
> :-) Forgive me if I sound naive, but that's rocket science to me!!
> 
> For a regression with K explanatory variables, I make a matrix with
> 2*K+2 columns, where I capture:
>     K coefficients and K standard errors
>     the residual sigma
>     R^2
> 
> My code is:
> 
>    # ------------------------------------------------------------ 
>    movingWindowRegression <- function(data, T, width, model, K) {
>      results = matrix(nrow=T, ncol=2*K+2)
>      for (i in width:T) {
>        details <- summary.lm(lm(as.formula(model),
data[(i-width+1):i,]))
>        n=1;
>        for (j in 1:K) {
>          results[i, n]   = details$coefficients[j, 1]
>          results[i, n+1] = details$coefficients[j, 2]
>          n = n + 2
>        }
>        results[i, n] = details$sigma
>        results[i, n+1] = details$r.squared
>      }
>      return(results)
>    }
>    
>    # Simulate some data for a linear regression
>    T = 20
>    x = runif(T); y = 2 + 3*x + rnorm(T);
>    D = data.frame(x, y)
>    
>    r = movingWindowRegression(D, T=T, width=10, model="y ~ x",
K=2)
>    print(r)
>    # ------------------------------------------------------------ 
> 
> I would be very happy if you could look at this and tell me how to do
> things better.
First, thanks for posting the code.  It takes courage to send your
code to the list like this for public commentary.  However, questions
like this provide instructive examples.

Some comments:

 As Gabor pointed out, there is a generic function nrow that can be
 applied to data frames.

 As a matter of style, it is better to use
    details <- summary(lm(model, data[<row range>,]))
 That is, call the generic function "summary", not the specific name
 of the method "summary.lm".  It is redundant to use
summary.lm(lm(...))
 and this construction is not guaranteed to continue to be valid.

 With regard to the looping, I would suggest using a list of summary
 objects as the basic data structure within your function, then
 extracting the pieces that you want from that list using lapply or sapply.

 That is, I would start with 

movingWindow <- function(formula, data, width, ...) {
    nr = nrow(data)
    width = as.integer(width)[1]
    if (width <= 0 || width >= nr)
        stop(paste("width must be in the range 1,...,", nr,
sep=""))
    nreg = nr - width
    base = 0:(width - 1)
    sumrys <- lapply(1:nreg,
                     function(st) {
                         summary(lm(formula, data[base + st,]))
                     })
    sumrys
}

  I changed the argument name "model" to "formula" and
changed the
  order of the arguments to the standard order used in R modeling
  functions.
 
  You may not want to use this form for very large data sets because
  the raw summaries could be very large.  However this is a place to
  start.
  
  The reason for creating a list is so you can use sapply and lapply to
  extract results from the list.  To get the coefficients and standard
  errors from a summary, use the coef generic and the select columns from
  the result.  For example,
> data(women)
> sumrys = movingWindow(weight ~ height, women, 9)
> unlist(lapply(sumrys, function(sm) sm$sigma))  # sigma values[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180
0.7627228> sapply(sumrys, "[[", "sigma")                  # same
thing[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180
0.7627228> sapply(sumrys, function(sm) coef(sm)[,1:2])             [,1]         [,2]         [,3]         [,4]        [,5]
[1,] -59.77777778 -67.12777778 -73.42222222 -81.97222222 -91.7777778
[2,]   3.00000000   3.11666667   3.21666667   3.35000000   3.5000000
[3,]   3.77647036   2.64466036   3.70537766   4.71400546   5.1522157
[4,]   0.06085806   0.04194352   0.05784947   0.07246601   0.0780042
              [,6]
[1,] -106.12777778
[2,]    3.71666667
[3,]    6.60219186
[4,]    0.09846709> 
  The last part shows how to get the coefficients estimates and their
  standard errors as columns of a matrix.  I think I would return the
  coefficients and standard errors separately, as in

movingWindow <- function(formula, data, width, ...) {
    nr = nrow(data)
    width = as.integer(width)[1]
    if (width <= 0 || width >= nr)
        stop(paste("width must be in the range 1,...,", nr,
sep=""))
    nreg = nr - width
    base = 0:(width - 1)
    sumrys <- lapply(1:nreg,
                     function(st) {
                         summary(lm(formula, data[base + st,]))
                     })
    list(coefficients = sapply(sumrys, function(sm)
coef(sm)[,"Estimate"]),
         Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std.
Error"]),
         sigma = sapply(sumrys, "[[", "sigma"),
         r.squared = sapply(sumrys, "[[", "r.squared"))
}
> movingWindow(weight ~ height, women, width = 9)$coefficients
                 [,1]       [,2]       [,3]      [,4]      [,5]        [,6]
(Intercept) -59.77778 -67.127778 -73.422222 -81.97222 -91.77778 -106.127778
height        3.00000   3.116667   3.216667   3.35000   3.50000    3.716667

$Std.Error
                  [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
(Intercept) 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157 6.60219186
height      0.06085806 0.04194352 0.05784947 0.07246601 0.0780042 0.09846709

$sigma
[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228

$r.squared
[1] 0.9971276 0.9987338 0.9977411 0.9967352 0.9965351 0.9951107

  The next enhancements come from realizing that you do not need
  to call lm repeatedly.  The lm function involves many steps working
  with the formula and the data frame and optional arguments to
  produce a model matrix and response vector.   You only need to do
  that once.  After that you can call lm.fit on subsets of the rows.

  If you arrange that the call to lm is based on the "matched call" to
  your function then you can get the ability to work with standard
  modeling arguments such as subset and na.action for free.  This may
  seem like an obscure point but there are big gains in having your
  modeling function behave like all the other R modeling functions.
  Thomas Lumley's article on "Standard non-standard evaluation in
R"
  (which I would say was available at developer.r-project.org except
  that the developer.r-project.org site is still down) explains this
  in more detail.

  Furthermore, by doing the initial lm fit you find out how many
  coefficients there are in the model and can do a better job of
  checking for a sensible "width" argument.

  There are subtleties in this version of movingWindow2 involving
  manipulation of the arguments in the original call to lm but these
  are explained in the manual page for lm.  You do need to set the
  class and the terms component in the result of lm.fit before you can
  apply summary to it.

movingWindow2 <- function(formula, data, width, ...) {
    mCall = match.call()
    mCall$width = NULL
    mCall[[1]] = as.name("lm")
    mCall$x = mCall$y = TRUE
    bigfit = eval(mCall, parent.frame())
    ncoef = length(coef(bigfit))
    nr = nrow(data)
    width = as.integer(width)[1]
    if (width <= ncoef || width >= nr)
        stop(paste("width must be in the range ", ncoef + 1,
                   ",...,", nr - 1, sep=""))
    y = bigfit$y
    x = bigfit$x
    terms = bigfit$terms
    base = 0:(width - 1)
    sumrys <- lapply(1:(nr - width),
                     function(st) {
                         inds = base + st
                         fit = lm.fit(x[inds,], y[inds])
                         fit$terms = terms
                         class(fit) = "lm"
                         summary(fit)
                     })
    list(coefficients = sapply(sumrys, function(sm)
coef(sm)[,"Estimate"]),
         Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std.
Error"]),
         sigma = sapply(sumrys, "[[", "sigma"),
         r.squared = sapply(sumrys, "[[", "r.squared"))
}
> movingWindow2(weight ~ height, women, 9)$coefficients
                 [,1]       [,2]       [,3]      [,4]      [,5]        [,6]
(Intercept) -59.77778 -67.127778 -73.422222 -81.97222 -91.77778 -106.127778
height        3.00000   3.116667   3.216667   3.35000   3.50000    3.716667

$Std.Error
                  [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
(Intercept) 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157 6.60219186
height      0.06085806 0.04194352 0.05784947 0.07246601 0.0780042 0.09846709

$sigma
[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228

$r.squared
[1] 0.9971276 0.9987338 0.9977411 0.9967352 0.9965351 0.9951107

  The use of lapply and sapply on lists is a style of programming
  called "functional programming".  The functional programming
  facilities in the S language are not as widely recognized and
  appreciated as they should be.  Phil Spector's book "An Introduction
  to S and S-PLUS" is one place where these are discussed and
  illustrated.

P.S. If building a list of summaries is taking too much memory then
replace summary(fit) by summary(fit)[c("coefficients",
"sigma", "r.squared")]

Warnes, Gregory R

2004-Apr-26 23:03 UTC

head link

[R] Moving window regressions - how can I improve this code?

movingWindow can be made even simpler when using the running() with recent
versions of the gregmisc library:


movingWindow <- function(formula, data, width, ...)
  {
    index <- 1:nrow(data)
    running(index, fun=function(st) summary(lm(formula, data[st,])),
width=width)
  }

this returns a *matrix* with one row per element of summary.  To get the
coefficients, do

movingWindow( weight ~ height, women, 5)["coefficients",]

-Greg

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Gabor 
> Grothendieck
> Sent: Sunday, April 25, 2004 10:07 AM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] Moving window regressions - how can I improve this
> code?
> 
> 
> Gabor Grothendieck <ggrothendieck <at> myway.com> writes:
>  
> > movingWindow <- function(formula, data, width, ...) {
> >     nr <- nrow(data)
> >     width <- as.integer(width)[1]
> >     stopifnot( width > 0, width <= nr )
> >     indices <- as.data.frame( t( embed( 1:nr, width ) ) )
> >     lapply(indices, function(st) summary(lm(formula, data[st,])) )
> > }
> > 
> 
> Just one further simplification using apply instead of lapply to
> eliminate having to transform embed:
> 
> movingWindow <- function(formula, data, width, ...) {
>     nr <- nrow(data)
>     width <- as.integer(width)[1]
>     stopifnot( width > 0, width <= nr )
>     apply( embed(1:nr, width), 1, # rows are indices of 
> successive windows
> 		function(st) summary(lm(formula, data[st,])) )
> }
> 
> This could also be used in movingWindow2, as well.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 

LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}

Gabor Grothendieck

2004-May-16 12:28 UTC

head link

[R] Moving window regressions - how can I improve this code?

Others have already addressed the bug.

Your other question seems to be how to get some simplification.
Note that Greg Warnes responded in:
http://article.gmane.org/gmane.comp.lang.r.general/18033
regarding the use of running in gregmisc. 

Another thing you could do
if you want to retain the speed of the lm.fit solution yet
not deal with the argument manipulations in that solution is
that you could pass the lm structure to your routine instead 
of reconstructing the lm structure in the routine, itself.  
That provides some simplification at the expense of a slightly 
more complex calling sequence:

# e.g. movingWindow3( lm(weight ~ height, women, x=TRUE, y=TRUE), 9 )
movingWindow3 <- function(lm., width) {
	x = lm.$x
	y = lm.$y
	nr = NROW(y)
	ncoef = length(coef(lm.))
	stopifnot(width >= ncoef, width <= nr)
	terms = lm.$terms
	inds = embed(1:nr, width)[, width:1]
	sumrys <- apply(inds, 1, function(ix) {
			fit = lm.fit(x[ix,,drop=F], y[ix])
			fit$terms = terms
			class(fit) = "lm"
			summary(fit)
	})
	list(coefficients = sapply(sumrys, function(sm)
coef(sm)[,"Estimate"]),
	  Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std. Error"]),
	  sigma = sapply(sumrys, "[[", "sigma"),
	  r.squared = sapply(sumrys, "[[", "r.squared"))
}
z3 <- movingWindow3( lm(weight ~ height, women, x=TRUE, y=TRUE), 9 


--- 
Ajay Shah <ajayshah at mayin.org> writes:
> I had written a posting, some weeks ago, where I had written
> fortrannish code for "moving window regressions" (also called
'rolling
> window regression'), and asked how to do the code better. Both of you
> had put much pain on this, and emerged with this great code:
> 
> data(women)
> movingWindow2 <- function(formula, data, width, ...) {
> mCall = match.call()
> mCall$width = NULL
> mCall[[1]] = as.name("lm")
> mCall$x = mCall$y = TRUE
> bigfit = eval(mCall, parent.frame())
> ncoef = length(coef(bigfit))
> nr = nrow(data)
> width = as.integer(width)[1]
> stopifnot(width >= ncoef, width <= nr)
> y = bigfit$y
> x = bigfit$x
> terms = bigfit$terms
> inds = embed(seq(nr), width)[, rev(seq(width))]
> sumrys <- lapply(seq(nrow(inds)),
> function(st) {
> ind = inds[st,]
> fit = lm.fit(x[ind,], y[ind])
> fit$terms = terms
> class(fit) = "lm"
> summary(fit)
> })
> list(coefficients = sapply(sumrys, function(sm)
coef(sm)[,"Estimate"]),
> Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std. Error"]),
> sigma = sapply(sumrys, "[[", "sigma"),
> r.squared = sapply(sumrys, "[[", "r.squared"))
> }
> 
> I have one piece of information, and one bugreport:
> 
> * I timed this, and compared it against my fortrannish code, and
> this is roughly 2.5x faster :-)
> 
> > junk = data.frame(x=runif(1000), y=runif(1000))
> > system.time(movingWindowRegression(women, 1000, 9, weight ~ height,
2))
> [1] 20.07 0.01 20.80 0.00 0.00
> > system.time(movingWindow2(y ~ x, junk, 10))
> [1] 8.27 0.03 8.43 0.00 0.00
You seem to be comparing timings on different data sets and models.
Did you mean to use junk and y ~ x in your first timing call?
> (My notebook is a Celeron @ 500 Mhz).
> 
> * I find it breaks when I ask him to do a regression on 1:
> 
> > r = movingWindowRegression(junk, 1000, 10, y ~ 1, 1)
> > movingWindow2(y ~ 1, junk, 10) 
> Error in lm.fit(x[ind, ], y[ind]) : `x' must be a matrix

Reasonably Related Threads

Search for more seemingly similar threads

R help - Apr 2004 - Moving window regressions - how can I improve this code?

[R] Moving window regressions - how can I improve this code?

[R] Moving window regressions - how can I improve this code?

[R] Moving window regressions - how can I improve this code?

[R] Moving window regressions - how can I improve this code?

[R] Moving window regressions - how can I improve this code?

Reasonably Related Threads