Bliese, Paul D LTC USAMH
2006-Aug-24 15:06 UTC
[R] Why are lagged correlations typically negative?
Recently, I was working with some lagged designs where a vector of observations at one time was used to predict a vector of observations at another time using a lag 1 design. In the work, I noticed a lot of negative correlations, so I ran a simple simulation with 2 matched points. The crude simulation example below shows that the correlation can be -1 or +1, but interestingly if you do this basic simulation thousands of times, you get negative correlations 66 to 67% of the time. If you simulate three matched observations instead of three you get negative correlations about 74% of the time and then as you simulate 4 and more observations the number of negative correlations asymptotically approaches an equal 50% for negative versus positive correlations (though then with 100 observations one has 54% negative correlations). Creating T1 and T2 so they are related (and not correlated 1 as in the crude simulation) attenuates the effect. A more advanced simulation is provided below for those interested. Can anyone explain why this occurs in a way a non-mathematician is likely to understand? Thanks, Paul ############# # Crude simulation #############> (T1<-rnorm(3))[1] -0.1594703 -1.3340677 0.2924988> (T2<-c(T1[2:3],NA))[1] -1.3340677 0.2924988 NA> cor(T1,T2, use="complete")[1] -1> (T1<-rnorm(3))[1] -0.84258593 -0.49161602 0.03805543> (T2<-c(T1[2:3],NA))[1] -0.49161602 0.03805543 NA> cor(T1,T2, use="complete")[1] 1 ########### # More advanced simulation example ###########> lagsfunction(nobs,nreps,rho=1){ OUT<-data.frame(NEG=rep(NA,nreps),COR=rep(NA,nreps)) nran<-nobs+1 #need to generate 1 more random number than there are observations for(i in 1:nreps){ V1<-rnorm(nran) V2<-sqrt(1-rho^2)*rnorm(nran)+rho*V1 #print(cor(V1,V2)) V1<-V1[1:nran-1] V2<-V2[2:nran] OUT[i,1]<-ifelse(cor(V1,V2)<=0,1,0) OUT[i,2]<-cor(V1,V2) } return(OUT) #out is a 1 if the corr is negative or 0; 0 if positive }> LAGS.2<-lags(2,10000) #Number of observations matched = 2 > mean(LAGS.2)NEG COR 0.6682 -0.3364
On Thu, 24 Aug 2006, Bliese, Paul D LTC USAMH wrote:> Recently, I was working with some lagged designs where a vector of > observations at one time was used to predict a vector of observations at > another time using a lag 1 design. In the work, I noticed a lot of > negative correlations, so I ran a simple simulation with 2 matched > points. The crude simulation example below shows that the correlation > can be -1 or +1, but interestingly if you do this basic simulation > thousands of times, you get negative correlations 66 to 67% of the time. > If you simulate three matched observations instead of three you get > negative correlations about 74% of the time and then as you simulate 4 > and more observations the number of negative correlations asymptotically > approaches an equal 50% for negative versus positive correlations > (though then with 100 observations one has 54% negative correlations). > Creating T1 and T2 so they are related (and not correlated 1 as in the > crude simulation) attenuates the effect. A more advanced simulation is > provided below for those interested. > > Can anyone explain why this occurs in a way a non-mathematician is > likely to understand?Consider the two points out of three case from the viewpoint of the middle point. The correlation is positive if the previous point is lower and the following point is higher, or vice versa. It is negative if the previous and following points are both higher or both lower. Now, if the middle point is higher than the first point it is probably higher than average, and so it has a more than 50% chance of also being higher than the third point. Similarly, if it is lower than the first point it is likely to be lower than the third point. So negative correlation is more likely than positive. Working out the covariance may be useful even for non-mathematicians. Call the three points X,Y,Z cov(X-Y, Y-Z) = cov(X,Y)-cov(Y,Y)-cov(X,Z)+cov(Y,Z) = 0 - var(Y) - 0 - 0 -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Gabor Grothendieck
2006-Aug-24 16:02 UTC
[R] Why are lagged correlations typically negative?
The covariance has the same sign as the correlation so lets calculate the sample covariance of the vector T1 = (X,Y) with T2 = (Y,Z) where we ignored the third component in each case due to use="complete". cov(T1, T2) = XY + YZ - (X+Y)/2 * (Y+Z)/2 X, Y and Z are random variables so we take the expectation to get the overall average over many runs. Expectation is linear and all the random variables are uncorrelated so: EXY + EYZ - E[(X+Y)/2 * (Y+Z)/2] = EXY + EYZ - EXY/4 - EXZ/4 - EYY/4 - EYZ/4 = -EYY/4 < 0 where the third line is due to the fact that all terms in the second line except the surviving term are zero. On 8/24/06, Bliese, Paul D LTC USAMH <paul.bliese at us.army.mil> wrote:> Recently, I was working with some lagged designs where a vector of > observations at one time was used to predict a vector of observations at > another time using a lag 1 design. In the work, I noticed a lot of > negative correlations, so I ran a simple simulation with 2 matched > points. The crude simulation example below shows that the correlation > can be -1 or +1, but interestingly if you do this basic simulation > thousands of times, you get negative correlations 66 to 67% of the time. > If you simulate three matched observations instead of three you get > negative correlations about 74% of the time and then as you simulate 4 > and more observations the number of negative correlations asymptotically > approaches an equal 50% for negative versus positive correlations > (though then with 100 observations one has 54% negative correlations). > Creating T1 and T2 so they are related (and not correlated 1 as in the > crude simulation) attenuates the effect. A more advanced simulation is > provided below for those interested. > > Can anyone explain why this occurs in a way a non-mathematician is > likely to understand? > > Thanks, > > Paul > > ############# > # Crude simulation > ############# > > (T1<-rnorm(3)) > [1] -0.1594703 -1.3340677 0.2924988 > > (T2<-c(T1[2:3],NA)) > [1] -1.3340677 0.2924988 NA > > cor(T1,T2, use="complete") > [1] -1 > > > (T1<-rnorm(3)) > [1] -0.84258593 -0.49161602 0.03805543 > > (T2<-c(T1[2:3],NA)) > [1] -0.49161602 0.03805543 NA > > cor(T1,T2, use="complete") > [1] 1 > > ########### > # More advanced simulation example > ########### > > lags > function(nobs,nreps,rho=1){ > OUT<-data.frame(NEG=rep(NA,nreps),COR=rep(NA,nreps)) > nran<-nobs+1 #need to generate 1 more random number than there are > observations > for(i in 1:nreps){ > V1<-rnorm(nran) > V2<-sqrt(1-rho^2)*rnorm(nran)+rho*V1 > #print(cor(V1,V2)) > V1<-V1[1:nran-1] > V2<-V2[2:nran] > OUT[i,1]<-ifelse(cor(V1,V2)<=0,1,0) > OUT[i,2]<-cor(V1,V2) > } > return(OUT) #out is a 1 if the corr is negative or 0; 0 if positive > } > > LAGS.2<-lags(2,10000) #Number of observations matched = 2 > > mean(LAGS.2) > NEG COR > 0.6682 -0.3364 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >