The acf and ccf functions assume that time series are stationary, but yours are
not.
I think that your alternative function is not well founded. You take a separate
mean for each sub-series, which implicitly allows the mean of the series to vary
arbitrarily with time. However, you only have one instance of each time series
so you can't have a non-parametric model for the means.
A parametric approach is to remove a linear trend from the series and then apply
ccf:
e1 <- residuals(lm(y1 ~ x))
e2 <- residuals(lm(y1 ~ x))
ccf(e1, e2)
which does identify a lag of -8 as the best match, although the correlation is
somewhat lower than what you found (0.87).
Martyn
________________________________________
From: r-devel-bounces at r-project.org [r-devel-bounces at r-project.org] on
behalf of Sami Toppinen [sami.toppinen at kolumbus.fi]
Sent: 04 November 2014 09:44
To: r-devel at r-project.org
Subject: [Rd] [R] Calculation of cross-correlation in ccf
Dear All,
I am studying some process measurement time series in R and trying to identify
time delays using cross-correlation function ccf. The results have however been
bit confusing. I found a couple of years old message about this issue but
unfortunately wasn't able to find it again for a reference.
For example, an obvious time shift is observed between the measurements y1 and
y2 when the following test data is plotted:
x <- 1:121
y1 <- c(328.27, 328.27, 328.27, 328.27, 328.21, 328.14, 328.14, 328.01,
328.07, 328.01, 327.87, 328.01, 328.07, 328.27, 328.27, 328.54, 328.61,
328.74, 328.88, 329.01, 329.01, 329.21, 329.28, 329.35, 329.35, 329.42,
329.35, 329.28, 329.28, 329.15, 329.08, 329.08, 328.95, 328.95, 328.95,
328.95, 329.01, 329.15, 329.21, 329.28, 329.55, 329.62, 329.75, 329.82,
329.89, 330.09, 330.09, 330.29, 330.29, 330.36, 330.42, 330.29, 330.15,
330.22, 330.09, 329.95, 329.82, 329.75, 329.62, 329.55, 329.48, 329.55,
329.68, 329.75, 329.82, 329.89, 330.09, 330.09, 330.15, 330.22, 330.42,
330.42, 330.42, 330.36, 330.42, 330.22, 330.15, 330.09, 329.89, 329.75,
329.55, 329.35, 329.35, 329.42, 329.48, 329.55, 329.75, 329.75, 329.82,
330.09, 330.15, 330.42, 330.42, 330.62, 330.69, 330.69, 330.83, 330.83,
330.76, 330.62, 330.62, 330.56, 330.42, 330.42, 330.29, 330.29, 330.29,
330.29, 330.22, 330.49, 330.56, 330.62, 330.76, 331.03, 330.96, 331.16,
331.23, 331.50, 331.63, 332.03, 332.03)
y2 <- c(329.68, 329.75, 329.82, 329.95, 330.02, 330.15, 330.22, 330.36,
330.22, 330.29, 330.29, 330.29, 330.29, 330.15, 330.22, 330.22, 330.15,
330.15, 330.15, 330.15, 330.15, 330.29, 330.49, 330.49, 330.62, 330.89,
331.03, 331.09, 331.16, 331.30, 331.30, 331.36, 331.43, 331.43, 331.43,
331.36, 331.36, 331.36, 331.36, 331.23, 331.23, 331.16, 331.16, 331.23,
331.30, 331.23, 331.36, 331.56, 331.70, 331.83, 331.97, 331.97, 332.10,
332.30, 332.44, 332.44, 332.51, 332.51, 332.57, 332.57, 332.51, 332.37,
332.24, 332.24, 332.10, 331.97, 331.97, 331.90, 331.83, 331.97, 331.97,
331.97, 332.03, 332.24, 332.30, 332.30, 332.37, 332.57, 332.57, 332.57,
332.57, 332.57, 332.51, 332.37, 332.30, 332.17, 331.97, 331.83, 331.70,
331.70, 331.63, 331.63, 331.70, 331.83, 331.90, 332.10, 332.24, 332.30,
332.44, 332.57, 332.71, 332.84, 332.77, 332.91, 332.84, 332.84, 332.91,
332.84, 332.77, 332.77, 332.64, 332.64, 332.57, 332.57, 332.57, 332.57,
332.57, 332.71, 332.98, 333.24, 333.58)
matplot(cbind(y1, y2))
However, the cross-correlation function surprisingly gives the maximum
correlation 0.83 at zero lag:
ccf(y1, y2)
Plotting of variables against each other with zero lag
plot(y1, y2)
shows that the correlation is not that good. Instead, a very nice correlation is
obtained with a plot with shifted variables:
plot(y1[1:113], y2[1:113 + 8])
As a comparison I defined my own cross correlation function:
cross.cor <- function(x, y, k)
{
n <- length(x) - abs(k)
if (k >= 0)
cor(x[1:n + k], y[1:n])
else
cor(x[1:n], y[1:n - k])
}
The new function cross.cor gives maximum correlation of 0.99 at lag -8, and the
shape of the correlation function is very different from the one obtained with
ccf (both methods give same value at zero lag):
plot(-15:15, sapply(-15:15, function(lag) cross.cor(y1, y2, lag)),
ylim = c(0.3, 1))
points(-15:15, ccf(y1, y2, 15, plot = FALSE)$acf, col = "red")
In order to understand the reason for the differences, I looked at the program
codes of ccf function. When the variables are compared with a nonzero lag, some
the data points must be left out from the tails of the time series. It appears
to me that ccf calculates (partly in R code, partly in C code) first the means
and the standard deviations using the whole data, and then uses those values as
constants in further calculations. The time series are truncated only in the
summations for covariances.
That approach naturally speeds up the computations, but is it really correct? Is
the approach used in ccf somehow statistically more correct? I suppose the
strong increasing trend in my test data emphasizes this issue (leaving data
points out from one end has big impact on the average).
Best regards,
Sami Toppinen
sami.toppinen at kolumbus.fi
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel-----------------------------------------------------------------------
This message and its attachments are strictly confidenti...{{dropped:8}}