I tried to bootstrap the correlation between two variables x1 and x2. The resulting distribution has two distinct peaks, how should I interprete it? The original code is attached. Y. C. Tao ---------------- library(boot); my.correl<-function(d, i) cor(d[i,1], d[i,2]) x1<-c(-2.612,-0.7859,-0.5229,-1.246,1.647,1.647,0.1811,-0.07097,0.8711,0.4323,0.1721,2.143, 4.33,0.5002,0.4015,-0.5225,2.538,0.07959,-0.6645,4.521,-1.371,0.3327,25.24,-0.5417,2.094,0.6064,-0.4476,-0.5891,-0.08879,-0.9487,-2.459e-05,-0.03887,0.2116,-0.0625,1.555,0.2069,-0.2142,-0.807,-0.6499,2.384,-0.02063,1.179,-0.0003586,-1.408,0.6928,0.689,0.1854,0.4351,0.5663,0.07171,-0.07004); x2<-c(0.08742,0.2555,-0.00337,0.03995,-1.208,-1.208,-0.001374,-1.282,1.341,-0.9069,-0.2011,1.557,0.4517,-0.4376,0.4747,0.04965,-0.1668,-0.6811,-0.7011,-1.457,0.04652,-1.117,6.744,-1.332,0.1327,-0.1479,-2.303,0.1235,0.5916,0.05018,-0.7811,0.5869,-0.02608,0.9594,-0.1392,0.4089,0.1468,-1.507,-0.6882,-0.1781,0.5434,-0.4957,0.02557,-1.406,-0.5053,-0.7345,-1.314,0.3178,-0.2108,0.4186,-0.03347); b<-boot(cbind(x1, x2), my.correl, 2000) hist(b$t, breaks=50)
Have you actually look at plot(x1, x2)? That ought to be quite enlightening. You have one data point: x1 x2 25.240 6.744 that's way out in the upper right. Every bootstrap sample that include that point will give an correlation that's high, and every bootstrap sample that does not include that point will give low (near zero) correlation. Now, the probability that one point is included in a bootstrap sample is roughly 63.8%. You can easily see that:> mean(b$t>.5)[1] 0.6385 Andy> From: Y C Tao > > I tried to bootstrap the correlation between two > variables x1 and x2. The resulting distribution has > two distinct peaks, how should I interprete it? > > The original code is attached. > > Y. C. Tao > > ---------------- > > library(boot); > > my.correl<-function(d, i) cor(d[i,1], d[i,2]) > > x1<-c(-2.612,-0.7859,-0.5229,-1.246,1.647,1.647,0.1811,-0.07097,0.8711,0.4323,0.1721,2.143,> 4.33,0.5002,0.4015,-0.5225,2.538,0.07959,-0.6645,4.521,-1.371, > 0.3327,25.24,-0.5417,2.094,0.6064,-0.4476,-0.5891,-0.08879,-0. > 9487,-2.459e-05,-0.03887,0.2116,-0.0625,1.555,0.2069,-0.2142,- > 0.807,-0.6499,2.384,-0.02063,1.179,-0.0003586,-1.408,0.6928,0.689,0.1854,0.4351,0.5663,0.07171,-0.07004);> > x2<-c(0.08742,0.2555,-0.00337,0.03995,-1.208,-1.208,-0.001374, > -1.282,1.341,-0.9069,-0.2011,1.557,0.4517,-0.4376,0.4747,0.049 > 65,-0.1668,-0.6811,-0.7011,-1.457,0.04652,-1.117,6.744,-1.332, > 0.1327,-0.1479,-2.303,0.1235,0.5916,0.05018,-0.7811,0.5869,-0. > 02608,0.9594,-0.1392,0.4089,0.1468,-1.507,-0.6882,-0.1781,0.54 > 34,-0.4957,0.02557,-1.406,-0.5053,-0.7345,-1.314,0.3178,-0.210 > 8,0.4186,-0.03347); > > b<-boot(cbind(x1, x2), my.correl, 2000) > hist(b$t, breaks=50) >
Hi! Simply plot(x1,x2): you will see that there is one point (number 23) at (x1,x2) = (25.34,6.744) which is a very long way from all the other points (which, among themselves, form a somewhat diffuse cluster with some suggestion of further structure). When you bootstrap, the correlation you obtain in any sample will depend on whether or not this outlying point is included in the sample. If it is included, this single point will generate a relatively high value of the correlation coefficient simply because it is such a long way from all the others (i.e. it is highly influential). If it is not included, then the diffuse character of the other points will generate a very low value of the correlation coefficient. > cor(x1,x2) [1] 0.7471931 > cor(x1[-23],x2[-23]) [1] 0.03914653 Therefore your bootstrap distribution will have two peaks: one peak, around 0.75, corresponding to the bootstrap samples which include this outlying point, and the other, around 0, corresponding to the bootstrap samples which do not include it. This is the explanation and, at the same time, the interpretation. Best wishes, Ted. On 11-Jul-04 Y C Tao wrote:> I tried to bootstrap the correlation between two > variables x1 and x2. The resulting distribution has > two distinct peaks, how should I interprete it? > > The original code is attached. > > Y. C. Tao > > ---------------- > > library(boot); > > my.correl<-function(d, i) cor(d[i,1], d[i,2]) > > x1<-c(-2.612,-0.7859,-0.5229,-1.246,1.647,1.647,0.1811, > -0.07097,0.8711,0.4323,0.1721,2.143,4.33,0.5002, > 0.4015,-0.5225,2.538,0.07959,-0.6645,4.521,-1.371, > 0.3327,25.24,-0.5417,2.094,0.6064,-0.4476,-0.5891, > -0.08879,-0.9487,-2.459e-05,-0.03887,0.2116,-0.0625,1.555, > 0.2069,-0.2142,-0.807,-0.6499,2.384,-0.02063,1.179, > -0.0003586,-1.408,0.6928,0.689,0.1854,0.4351,0.5663, > 0.07171,-0.07004); > > x2<-c( 0.08742,0.2555,-0.00337,0.03995,-1.208,-1.208,-0.001374, > -1.282,1.341,-0.9069,-0.2011,1.557,0.4517,-0.4376, > 0.4747,0.04965,-0.1668,-0.6811,-0.7011,-1.457,0.04652, > -1.117,6.744,-1.332,0.1327,-0.1479,-2.303,0.1235, > 0.5916,0.05018,-0.7811,0.5869,-0.02608,0.9594,-0.1392, > 0.4089,0.1468,-1.507,-0.6882,-0.1781,0.5434,-0.4957, > 0.02557,-1.406,-0.5053,-0.7345,-1.314,0.3178,-0.2108, > 0.4186,-0.03347); > > b<-boot(cbind(x1, x2), my.correl, 2000) > hist(b$t, breaks=50)[The above rearranged to have 7 values in each conplete line] -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 11-Jul-04 Time: 10:40:34 ------------------------------ XFMail ------------------------------