Philippe Hensel
2012-Jul-27 14:30 UTC
[R] How to run regressions over increasing time series
Hello, I would like to run a series of regressions on my data (response variable over time): 1) regression from T1 to T2 2) regressions from T1 through T3 3) regression from T1 through T4, etc. I have been struggling to find a way to do this through commands, as opposed to cutting up the data manually (my dataset has over 6000 rows/observations). An illustrative dataset can be created thusly: dat <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67, 0.67, 0.74, 0.74, 0.74), Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6)), .Names = c("Years","Obs"), row.names = c(NA, -12L), class = "data.frame") I was trying to use a loop to create subsets of the data corresponding to the sets of time intervals required (e.g. T1 to T2, T1 through T3, etc.), but I am having trouble generating a new variable to index time (instead of the decimal values). I was figuring that indexing time would allow me to use a loop to generate the required subsets of data. I can figure out how many time periods I have and assign a sequential number to them: Years <- unique(set.data$Yrs) Yrs_count <- seq(from = 1, to = length(Years), by = 1) And then I can combine these into a dataframe: Yrs_combo <- cbind(Years,Yrs_count) However, how do I combine this data frame with my larger dataset, which has different numbers of rows? But this is just an intermediary step in the process.... Some of you might suggest an entirely different route. For now, I can manually create this new time index: dat2 <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67, 0.67, 0.74, 0.74, 0.74), Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6), Yrs_count = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)), .Names = c("Years","Obs","Yrs_count"), row.names = c(NA, -12L), class "data.frame") The next question is how can I index temporary files in a loop that I use for extracting the needed data? I thought I might need two loops: one to identify the length of the time series, the other to accumulate the data from T1 through the identified end point - maybe something like: for (i in 1:Yrs_count) { for (j in 1:i) { keyj <- dat2[,3]==j dat2j <- dat2[keyj,] # here is where I want to create a temporary file to accumulate the different dat2j's I create in this inside loop } # here is where I want to save the file for future use in my regressions } I hope this example is clear enough. My apologies if it isn't - and I thank the R community for any ideas, tips, or directions to information that might be helpful. Best, -Philippe -- Philippe Hensel, PhD NOAA National Geodetic Survey NGS ECO <http://www.ngs.noaa.gov/web/science_edu/ecosystems_climate/> N/NGS2 SSMC3 #8859 1315 East-West Hwy Silver Spring MD 20910 (301) 713 3198 x 137 [[alternative HTML version deleted]]
Philippe, In your example, you have four unique values for Yrs (I had to change your code a little to get it to run, so I have the modified version with my code below), and those values are what you are referring to when you say T1, T2, T3, T4, right? If I follow what you want to do, the code below should help. I opted to just save the regression results rather than saving each of the subsetted data sets. Of course, you can modify the code to save whatever you want. Hope this helps. # input data set.data <- structure(list( Yrs= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67, 0.67, 0.74, 0.74, 0.74), Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6)), .Names = c("Yrs","Obs"), row.names = c(NA, -12L), class = "data.frame") # determine the unique years Years <- unique(set.data$Yrs) # create an empty list with a length one less than the number of unique years regressions <- vector("list", length(Years)-1) # for time periods T2, T3, T4, fit a regression to T1:Ti and save the results # to the regressions list just created for(i in 2:length(Years)) { dati <- set.data[set.data$Yrs<=Years[i], ] regressions[[i-1]] <- lm(Obs ~ Yrs, data=dati) } Jean Philippe Hensel <philippe.hensel@noaa.gov> wrote on 07/27/2012 09:30:26 AM:> > Hello, > > I would like to run a series of regressions on my data (responsevariable> over time): > > 1) regression from T1 to T2 > 2) regressions from T1 through T3 > 3) regression from T1 through T4, etc. > > I have been struggling to find a way to do this through commands, as > opposed to cutting up the data manually (my dataset has over 6000 > rows/observations). > > An illustrative dataset can be created thusly: > > dat <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67, > 0.67, 0.74, 0.74, 0.74), > Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6)), > .Names = c("Years","Obs"), row.names = c(NA, -12L), class ="data.frame")> > I was trying to use a loop to create subsets of the data correspondingto> the sets of time intervals required (e.g. T1 to T2, T1 through T3,etc.),> but I am having trouble generating a new variable to index time (insteadof> the decimal values). I was figuring that indexing time would allow meto> use a loop to generate the required subsets of data. > > I can figure out how many time periods I have and assign a sequential > number to them: > > Years <- unique(set.data$Yrs) > Yrs_count <- seq(from = 1, to = length(Years), by = 1) > > And then I can combine these into a dataframe: > > Yrs_combo <- cbind(Years,Yrs_count) > > However, how do I combine this data frame with my larger dataset, whichhas> different numbers of rows? > > > > But this is just an intermediary step in the process.... Some of youmight> suggest an entirely different route. > > > > For now, I can manually create this new time index: > > dat2 <- structure(list(Years= c(0, 0, 0, 0.36, 0.36, 0.36, 0.67, 0.67, > 0.67, 0.74, 0.74, 0.74), > Obs = c(0, 0, 0, 2.3, 1.9, 2.1, 4.5, 4.5, 4.6, 5.3, 5.5, 5.6), > Yrs_count = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)), > .Names = c("Years","Obs","Yrs_count"), row.names = c(NA, -12L), class > "data.frame") > > > The next question is how can I index temporary files in a loop that Iuse> for extracting the needed data? I thought I might need two loops: oneto> identify the length of the time series, the other to accumulate the data > from T1 through the identified end point - maybe something like: > > for (i in 1:Yrs_count) { > for (j in 1:i) { > keyj <- dat2[,3]==j > dat2j <- dat2[keyj,] > # here is where I want to create a temporary file to accumulate the > different dat2j's I create in this inside loop > } > # here is where I want to save the file for future use in my regressions > > } > > > I hope this example is clear enough. My apologies if it isn't - and I > thank the R community for any ideas, tips, or directions to information > that might be helpful. > Best, > > -Philippe > > -- > > Philippe Hensel, PhD > > NOAA National Geodetic Survey > > NGS ECO <http://www.ngs.noaa.gov/web/science_edu/ecosystems_climate/> > > N/NGS2 SSMC3 #8859 > > 1315 East-West Hwy > Silver Spring MD 20910 > (301) 713 3198 x 137[[alternative HTML version deleted]]