Josef.Kardos at phila.gov
2010-Dec-16 20:40 UTC
[R] moving average with gaps in time series
I have a time series with interval of 2.5 minutes, or 24 observations per
hour. I am trying to find a 1 hr moving average, looking backward, so
that moving average at n = mean(n-23 : n)
The time series has about 1.5 million rows, with occasional gaps due to
poor data quality. I only want to take a 1 hour moving average for those
periods that are complete, i.e. have 24 observations in the previous hour.
The data is in 3 columns
Value DateTime interval
For example:
Value <- rnorm (100, 50, 3) #my data has 1.5 million rows; using 100 here
for simple example
DateTime <- seq(from = 915148800, to=915156150, by =150) #time steps
1:50 at 150 second intervals
DateTime [51] <- 915156450 #skip one time step;
DateTime[52:100] <- seq(from = 915156600, to =915163800, by = 150) #resume
time steps of 150 seconds
x <- cbind (Value, DateTime)
x <- as.data.frame(x)
x$DateTime <-as.POSIXct(x$DateTime, origin="1970-01-01",
tz="GMT")
x1 <- x[-c(1:23), ] #trimming x to create direct comparison of DateTimes
in x and x1
x[,3] <- difftime(x1[,2], x[,2], units="mins") #ignore warning
message
colnames(x) [3] <- "interval"
x[24:nrow(x),3] <- x[1:(nrow(x)-23),3] #set interval to be the number
of minutes between n-23 and n.
#57.5 indicates no gaps in the previous hour up to and including
n.
# >57.5 indicates a gap in the previous n-23 rows.
x[1:23,3] <- 0/0 #NaN assigned to first 23 rows so as not to take average
of first hour.
#as expected, row 51: 73 indicates a gap, i.e. interval > 57.5
index <- which (x [,3] == 57.5) #which rows have no gaps in previous hour
#loop to calculate 1 hour moving average, only for periods which are
complete
for (i in 1:length(index)) {
x [index[i],4] <- mean(x[index[i]-23,1] : x[index[i],1])
}
#This loop works on this simple example; but this takes VERY long time to
run on x with 1.5 million rows. Over 1 hour running and still not
complete. I also tried increasing memory.limit to 4095, but still very
slow.
Any suggestions to make this run faster? I thought about using the lag
function but could not get it to work
[[alternative HTML version deleted]]
I'm not completely sure what your test setup does, but wouldn't it be
simpler to take a moving average at every point, but set the window
dynamically to be over the previous hour?
In pseudocode, for the j-th sample:
meanval[j] <- mean(Value[ DateTime[(DateTime-DateTime[j])>0 &
(Datetime-DateTime[j]<3600)] ] )
where I'm just letting DateTime values be in seconds.
If you really want to reject all averages that would have 'looked back'
over a missing sample, I'd play with rle() to find the bad points and
use that to reduce the meanval vector I described above.
Carl
********quote****
I have a time series with interval of 2.5 minutes, or 24 observations
per hour. I am trying to find a 1 hr moving average, looking backward,
so that moving average at n = mean(n-23 : n)
The time series has about 1.5 million rows, with occasional gaps due to
poor data quality. I only want to take a 1 hour moving average for those
periods that are complete, i.e. have 24 observations in the previous hour.
The data is in 3 columns
Value DateTime interval
For example:
Value <- rnorm (100, 50, 3) #my data has 1.5 million rows; using 100
here for simple example
DateTime <- seq(from = 915148800, to=915156150, by =150) #time steps
1:50 at 150 second intervals
DateTime [51] <- 915156450 #skip one time step; DateTime[52:100] <-
seq(from = 915156600, to =915163800, by = 150) #resume time steps of 150
seconds
x <- cbind (Value, DateTime)
x <- as.data.frame(x)
x$DateTime <-as.POSIXct(x$DateTime, origin="1970-01-01",
tz="GMT") x1 <-
x[-c(1:23), ] #trimming x to create direct comparison of DateTimes in x
and x1
x[,3] <- difftime(x1[,2], x[,2], units="mins") #ignore warning
message
colnames(x) [3] <- "interval"
x[24:nrow(x),3] <- x[1:(nrow(x)-23),3] #set interval to be the number of
minutes between n-23 and n.
#57.5 indicates no gaps in the previous hour up to and including n.
# >57.5 indicates a gap in the previous n-23 rows.
x[1:23,3] <- 0/0 #NaN assigned to first 23 rows so as not to take
average of first hour.
#as expected, row 51: 73 indicates a gap, i.e. interval > 57.5
index <- which (x [,3] == 57.5) #which rows have no gaps in previous hour
#loop to calculate 1 hour moving average, only for periods which are
complete
for (i in 1:length(index)) {
x [index[i],4] <- mean(x[index[i]-23,1] : x[index[i],1]) }
#This loop works on this simple example; but this takes VERY long time
to run on x with 1.5 million rows. Over 1 hour running and still not
complete. I also tried increasing memory.limit to 4095, but still very
slow.
Any suggestions to make this run faster? I thought about using the lag
function but could not get it to work
Hi Josef, is there any particular reason why you want to use your own function? Have a look at the stats functions or special timeseries functions for R. I am sure you will find something that calculates an ordinary moving average (and a bunch of fancier stuff). 'filter' (stats) for example can be used for moving averages. You could also use 'rollmean' from package zoo. Not sure how to handle the missing values here, though (both ways most probably work with NAs). You could try 'rollaply 'from package zoo with a function that sums the valid values in the window and returns TRUE in case all are valid and FALSE if they are not. Not sure whether this is faster than your solution though. Are you sure you want to compute a moving average with your missing values specification? Or do you just want to aggregate your values to hourly means, with those values removed where not all values are valid? (in that case 'aggregate' would be your function) HTH Jannis Josef.Kardos at phila.gov schrieb:> I have a time series with interval of 2.5 minutes, or 24 observations per > hour. I am trying to find a 1 hr moving average, looking backward, so > that moving average at n = mean(n-23 : n) > > The time series has about 1.5 million rows, with occasional gaps due to > poor data quality. I only want to take a 1 hour moving average for those > periods that are complete, i.e. have 24 observations in the previous hour. > > The data is in 3 columns > > Value DateTime interval > > For example: > Value <- rnorm (100, 50, 3) #my data has 1.5 million rows; using 100 here > for simple example > DateTime <- seq(from = 915148800, to=915156150, by =150) #time steps > 1:50 at 150 second intervals > DateTime [51] <- 915156450 #skip one time step; > DateTime[52:100] <- seq(from = 915156600, to =915163800, by = 150) #resume > time steps of 150 seconds > > x <- cbind (Value, DateTime) > x <- as.data.frame(x) > x$DateTime <-as.POSIXct(x$DateTime, origin="1970-01-01", tz="GMT") > x1 <- x[-c(1:23), ] #trimming x to create direct comparison of DateTimes > in x and x1 > x[,3] <- difftime(x1[,2], x[,2], units="mins") #ignore warning message > colnames(x) [3] <- "interval" > x[24:nrow(x),3] <- x[1:(nrow(x)-23),3] #set interval to be the number > of minutes between n-23 and n. > #57.5 indicates no gaps in the previous hour up to and including > n. > # >57.5 indicates a gap in the previous n-23 rows. > x[1:23,3] <- 0/0 #NaN assigned to first 23 rows so as not to take average > of first hour. > #as expected, row 51: 73 indicates a gap, i.e. interval > 57.5 > > index <- which (x [,3] == 57.5) #which rows have no gaps in previous hour > > #loop to calculate 1 hour moving average, only for periods which are > complete > for (i in 1:length(index)) { > x [index[i],4] <- mean(x[index[i]-23,1] : x[index[i],1]) > } > > #This loop works on this simple example; but this takes VERY long time to > run on x with 1.5 million rows. Over 1 hour running and still not > complete. I also tried increasing memory.limit to 4095, but still very > slow. > > Any suggestions to make this run faster? I thought about using the lag > function but could not get it to work > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >