Philippe Hensel
2012-Jul-27 14:30 UTC
[R] How to run regressions over increasing time series
Hello,
I would like to run a series of regressions on my data (response variable
over time):
1) regression from T1 to T2
2) regressions from T1 through T3
3) regression from T1 through T4, etc.
I have been struggling to find a way to do this through commands, as
opposed to cutting up the data manually (my dataset has over 6000
rows/observations).
An illustrative dataset can be created thusly:
dat <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67,
0.67, 0.74, 0.74, 0.74),
Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6)),
.Names = c("Years","Obs"), row.names = c(NA, -12L), class =
"data.frame")
I was trying to use a loop to create subsets of the data corresponding to
the sets of time intervals required (e.g. T1 to T2, T1 through T3, etc.),
but I am having trouble generating a new variable to index time (instead of
the decimal values). I was figuring that indexing time would allow me to
use a loop to generate the required subsets of data.
I can figure out how many time periods I have and assign a sequential
number to them:
Years <- unique(set.data$Yrs)
Yrs_count <- seq(from = 1, to = length(Years), by = 1)
And then I can combine these into a dataframe:
Yrs_combo <- cbind(Years,Yrs_count)
However, how do I combine this data frame with my larger dataset, which has
different numbers of rows?
But this is just an intermediary step in the process.... Some of you might
suggest an entirely different route.
For now, I can manually create this new time index:
dat2 <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67,
0.67, 0.74, 0.74, 0.74),
Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6),
Yrs_count = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)),
.Names = c("Years","Obs","Yrs_count"), row.names =
c(NA, -12L), class "data.frame")
The next question is how can I index temporary files in a loop that I use
for extracting the needed data? I thought I might need two loops: one to
identify the length of the time series, the other to accumulate the data
from T1 through the identified end point - maybe something like:
for (i in 1:Yrs_count) {
for (j in 1:i) {
keyj <- dat2[,3]==j
dat2j <- dat2[keyj,]
# here is where I want to create a temporary file to accumulate the
different dat2j's I create in this inside loop
}
# here is where I want to save the file for future use in my regressions
}
I hope this example is clear enough. My apologies if it isn't - and I
thank the R community for any ideas, tips, or directions to information
that might be helpful.
Best,
-Philippe
--
Philippe Hensel, PhD
NOAA National Geodetic Survey
NGS ECO <http://www.ngs.noaa.gov/web/science_edu/ecosystems_climate/>
N/NGS2 SSMC3 #8859
1315 East-West Hwy
Silver Spring MD 20910
(301) 713 3198 x 137
[[alternative HTML version deleted]]
Philippe,
In your example, you have four unique values for Yrs (I had to change your
code a little to get it to run, so I have the modified version with my
code below), and those values are what you are referring to when you say
T1, T2, T3, T4, right? If I follow what you want to do, the code below
should help. I opted to just save the regression results rather than
saving each of the subsetted data sets. Of course, you can modify the
code to save whatever you want. Hope this helps.
# input data
set.data <- structure(list(
Yrs= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67, 0.67, 0.74, 0.74,
0.74),
Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6)),
.Names = c("Yrs","Obs"), row.names = c(NA, -12L),
class =
"data.frame")
# determine the unique years
Years <- unique(set.data$Yrs)
# create an empty list with a length one less than the number of unique
years
regressions <- vector("list", length(Years)-1)
# for time periods T2, T3, T4, fit a regression to T1:Ti and save the
results
# to the regressions list just created
for(i in 2:length(Years)) {
dati <- set.data[set.data$Yrs<=Years[i], ]
regressions[[i-1]] <- lm(Obs ~ Yrs, data=dati)
}
Jean
Philippe Hensel <philippe.hensel@noaa.gov> wrote on 07/27/2012 09:30:26
AM:>
> Hello,
>
> I would like to run a series of regressions on my data (response
variable> over time):
>
> 1) regression from T1 to T2
> 2) regressions from T1 through T3
> 3) regression from T1 through T4, etc.
>
> I have been struggling to find a way to do this through commands, as
> opposed to cutting up the data manually (my dataset has over 6000
> rows/observations).
>
> An illustrative dataset can be created thusly:
>
> dat <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67,
> 0.67, 0.74, 0.74, 0.74),
> Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6)),
> .Names = c("Years","Obs"), row.names = c(NA, -12L),
class =
"data.frame")>
> I was trying to use a loop to create subsets of the data corresponding
to> the sets of time intervals required (e.g. T1 to T2, T1 through T3,
etc.),> but I am having trouble generating a new variable to index time (instead
of> the decimal values). I was figuring that indexing time would allow me
to> use a loop to generate the required subsets of data.
>
> I can figure out how many time periods I have and assign a sequential
> number to them:
>
> Years <- unique(set.data$Yrs)
> Yrs_count <- seq(from = 1, to = length(Years), by = 1)
>
> And then I can combine these into a dataframe:
>
> Yrs_combo <- cbind(Years,Yrs_count)
>
> However, how do I combine this data frame with my larger dataset, which
has> different numbers of rows?
>
>
>
> But this is just an intermediary step in the process.... Some of you
might> suggest an entirely different route.
>
>
>
> For now, I can manually create this new time index:
>
> dat2 <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67,
> 0.67, 0.74, 0.74, 0.74),
> Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6),
> Yrs_count = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)),
> .Names = c("Years","Obs","Yrs_count"),
row.names = c(NA, -12L), class > "data.frame")
>
>
> The next question is how can I index temporary files in a loop that I
use> for extracting the needed data? I thought I might need two loops: one
to> identify the length of the time series, the other to accumulate the data
> from T1 through the identified end point - maybe something like:
>
> for (i in 1:Yrs_count) {
> for (j in 1:i) {
> keyj <- dat2[,3]==j
> dat2j <- dat2[keyj,]
> # here is where I want to create a temporary file to accumulate the
> different dat2j's I create in this inside loop
> }
> # here is where I want to save the file for future use in my regressions
>
> }
>
>
> I hope this example is clear enough. My apologies if it isn't - and I
> thank the R community for any ideas, tips, or directions to information
> that might be helpful.
> Best,
>
> -Philippe
>
> --
>
> Philippe Hensel, PhD
>
> NOAA National Geodetic Survey
>
> NGS ECO <http://www.ngs.noaa.gov/web/science_edu/ecosystems_climate/>
>
> N/NGS2 SSMC3 #8859
>
> 1315 East-West Hwy
> Silver Spring MD 20910
> (301) 713 3198 x 137
[[alternative HTML version deleted]]