Tim Dorscheidt
2012-May-11 11:06 UTC
[R] Possible artifacts in cross-correlation function ("ccf")?
Dear R-users, I have been using R and its core-packages with great satisfaction now for many years, and have recently started using the "ccf" function (part of the "stats" package version 2.16.0), about which I have a question. The "ccf"-algorithm for calculating the cross-correlation between two time series always calculates the mean and standard deviation per time series beforehand, thereby using a constant value for these irrespective of any time-lag. Another piece of statistical software that I'm using, a toolbox in Matlab, does this in a fundamentally different way. It first "chops off" the parts of the time-series that do not overlap when a time-lag has been introduced, and then calculates a new mean and standard deviation to be used for further calculations. This latter method has the advantage of always theoretically still being able to obtain a cross-correlation of 1 (or -1), whereas the "ccf"-method of R seems to introduce zeros at the non-overlapping parts of the time-series, thereby preventing this possibility and producing very different results. Take for instance the two time series: a = {1,3,2} and b = {3,2,1}. The query "ccf(a,b)" produces the output {-0.5, -0.5, 0.5}, but I would think that a time-lag of -1 should produce a cross-correlation here of 1, since the two time series will overlap with identical parts {3,2}. I have attached clean implementations (removing all dependencies) of how the R algorithm seems to calculate cross-correlations with time-lag (it produces identical results to "ccf"), and how this other method (in Matlab) calculates it (with newly calculated means and standard deviation for each time-lag). Could someone be so kind as to explain to me why the "ccf"-algorithm has this specific implementation that seems to, at least for specific situations, produce results with artifacts? It is very likely that the R-implementation, as opposed to the alternative algorithm described above and in the attachment, has a very good statistical explanation, but one that unfortunately is not dawning on me. Sincerely, Tim Dorscheidt
Duncan Murdoch
2012-May-11 13:02 UTC
[R] Possible artifacts in cross-correlation function ("ccf")?
On 11/05/2012 7:06 AM, Tim Dorscheidt wrote:> Dear R-users, > > I have been using R and its core-packages with great satisfaction now for many years, and have recently started using the "ccf" function (part of the "stats" package version 2.16.0), about which I have a question. > > The "ccf"-algorithm for calculating the cross-correlation between two time series always calculates the mean and standard deviation per time series beforehand, thereby using a constant value for these irrespective of any time-lag. Another piece of statistical software that I'm using, a toolbox in Matlab, does this in a fundamentally different way. It first "chops off" the parts of the time-series that do not overlap when a time-lag has been introduced, and then calculates a new mean and standard deviation to be used for further calculations. This latter method has the advantage of always theoretically still being able to obtain a cross-correlation of 1 (or -1), whereas the "ccf"-method of R seems to introduce zeros at the non-overlapping parts of the time-series, thereby preventing this possibility and producing very different results. Take for instance the two time series: a = {1,3,2} and b = {3,2,1}. The query "ccf(a,b)" produces the output {-0.5, -0.5, 0.5}, but I would t! > hink that > a time-lag of -1 should produce a cross-correlation here of 1, since the two time series will overlap with identical parts {3,2}. > > I have attached clean implementations (removing all dependencies) of how the R algorithm seems to calculate cross-correlations with time-lag (it produces identical results to "ccf"), and how this other method (in Matlab) calculates it (with newly calculated means and standard deviation for each time-lag). > > Could someone be so kind as to explain to me why the "ccf"-algorithm has this specific implementation that seems to, at least for specific situations, produce results with artifacts? It is very likely that the R-implementation, as opposed to the alternative algorithm described above and in the attachment, has a very good statistical explanation, but one that unfortunately is not dawning on me.I haven't looked at the ccf code (and your attachment didn't make it through), but I would guess from your description that ccf produces positive semi-definite covariances, and the Matlab routine does not. That means that if you use the estimated covariances to compute variances of linear combinations of terms, you may be able to get negative answers from the Matlab routine. Sometimes this matters. Duncan Murdoch