Dear All, I'am looking for examples showing that correlation does not imply causality, the targeted audience consists of undergraduate students (their first year at the university but in the BioMathStat track). All practicals are under R. I was able to extract this from R datasets: ### begin data(sunspots) data(lynx) spots <- window(sunspots, freq = 1, start = 1880, end = 1900) lnx <- window(lynx, start = 1880, end = 1900) ratio <- max(lnx)/max(spots) par(mai = rep(1, 4)) plot(lnx, main = "Sun spots intensity\nand lynx population density", t = "b", ylab = "") lines(ratio*spots, col = "red", t = "b") axis(side = 4, col = "red", col.axis = "red", at = ratio*pretty(spots), lab = pretty(spots)) legend(1887, 4500, col = c("red", "black"), c("spots", "lynx"), pch = 21) ### end Shouldn't I try to publish this in Nature or Science with a title that goes "Solar activity increase libido in lynx populations" ? Note the nice shift between the two curves corresponding to the gestation time! A brief look at the whole dataset demonstates that this is definitively wrong: ### begin spots <- window(sunspots, freq = 1, start = 1821, end = 1934) ratio <- max(lynx)/max(spots) plot(lynx, main = "Sun spots intensity\nand lynx population density", ylab = "") lines(ratio*spots, col = "red") axis(side = 4, col = "red", col.axis = "red", at = ratio*pretty(spots), lab = pretty(spots)) legend(1870, 6000, col = c("red", "black"), c("spots", "lynx"), lty = 1) ### end So, I'am looking for similar examples, any hint would be greatly appreciated. Basically, I'am looking for correlations between completely deconnected phenomena so that the overinterpreted causality relationships look stupid at first glance (and the more funny the link, the better). BTW, does someone know where to find the data of this example of a high correlation between beer drinking in the US and children mortality in japan (or something like that, I'm unsure, I have googled around with these keywords but found nothing) ? All the best, and many thanks to the whole R team for providing us such a nice tool. Jean -- Jean R. Lobry (lobry at biomserv.univ-lyon1.fr) Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I, 43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE allo : +33 472 43 12 87 fax : +33 472 43 13 88 http://pbil.univ-lyon1.fr/members/lobry/
On Sat, Nov 15, 2003 at 03:49:29PM +0100, Jean lobry wrote:> BTW, does someone know where to find the data of this example > of a high correlation between beer drinking in the US and children > mortality in japan (or something like that, I'm unsure, I have > googled around with these keywords but found nothing) ?IIRC someone did a paper a few ago correlating "everything" in a OECD or World Bank database of macroeconomic variables... and found something like milk production in Indonesia to be a perfect predictor for subsequent SP500 returns. The point was, of course, to show the fallacy in such 'data mining'. I wish I had the original reference. Dirk -- Those are my principles, and if you don't like them... well, I have others. -- Groucho Marx
On Sat, 15 Nov 2003, Jean lobry wrote:> Dear All, > > I'am looking for examples showing that correlation does not imply > causality, the targeted audience consists of undergraduate students > (their first year at the university but in the BioMathStat track). > All practicals are under R. >There's a nice example we use (data at http://courses.washington.edu/b517/datasets/fev.txt documentation at http://courses.washington.edu/b517/datasets/fevdoc.txt ) These are lung function (FEV1) data on children, taken at routine checkups, and we tell people to do a t.test comparing smokers and non-smokers. There is a large and statistically significant difference -- the smokers have *higher* FEV1, because they are older. In a statistics course for graduate students in public health there are always a few people who see the difference and forget to check the direction... -thomas
On Sat, Nov 15, 2003 at 03:49:29PM +0100, Jean lobry wrote:> Dear All, > > I'am looking for examples showing that correlation does not imply > causality, the targeted audience consists of undergraduate students > (their first year at the university but in the BioMathStat track). > All practicals are under R. >The dataset below contains data by state, including population in thousands, area in square miles, percent urban population, percent below poverty line, whether there are gun registration laws or not, and the number of homicides. The socioeconomic data are from 1990/91, from the census bureau as I recall. The gun registration indicator is taken from a USA Today article (Tuesday, January 7, 1992, PAGE 5A). The article reported that gun registration laws lead to increased numbers of murders (homicides), a conclusion reached by comparing the mean number of homicides in states with gun registration laws to states without registration laws. "Guns" <- structure(.Data = list( "pop" = c(4089, 2372, 30380, 3291, 598, 13277, 1135, 2795, 11543, 5996, 4860, 9368, 4432, 5158, 6737, 635, 7760, 18058, 10939, 11961, 1004, 3560, 4953, 17349, 1770, 5018, 570, 3750, 3377, 680, 6623, 1039, 5610, 2495, 3713, 4252, 1235, 2592, 808, 1593, 1105, 1548, 1284, 3175, 2922, 703, 6286, 567, 4955, 1801, 460.), "area" = c(52.4, 53.2, 163.7, 5.5, 0.1, 65.8, 10.9, 56.3, 57.9, 10.6, 12.4, 96.8, 86.9, 69.7, 53.8, 70.7, 8.7, 54.5, 44.8, 46.1, 1.5, 32, 42.1, 268.6, 84.9, 71.3, 656.4, 114, 104.1, 2.5, 59.4, 83.6, 36.4, 82.3, 40.4, 51.8, 35.4, 48.4, 147, 77.4, 9.4, 121.6, 110.6, 69.9, 98.4, 77.1, 42.8, 9.6, 65.5, 24.2, 97.8), "urban" = c(60, 54, 93, 79, 100, 85, 89, 61, 85, 84, 81, 70, 71, 53, 50, 53, 89, 84, 74, 69, 86, 55, 61, 80, 87, 76, 68, 88, 82, 73, 63, 57, 65, 69, 52, 68, 45, 47, 53, 66, 51, 73, 88, 68, 71, 50, 69, 32, 66, 36, 65.), "poverty" = c(19, 18.4, 14.2, 5.8, 19.2, 14.1, 10, 10.1, 13.3, 10.2, 9.3, 13.9, 12, 13.6, 13.2, 13.5, 9, 14.1, 11.8, 10.8, 8.2, 16.5, 16.9, 16.8, 9.8, 26.2, 11.2, 14.2, 12.1, 8.1, 16, 13.7, 14.1, 11.1, 17.4, 22, 12.5, 23.8, 15.8, 10.9, 7.1, 20.9, 10.7, 15.8, 11.3, 13.5, 10.6, 7.1, 9.2, 17.2, 10.6), "gunreg" = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.), "homicides" = c(410, 240, 3710, 170, 489, 1300, 44, 62, 1270, 200, 540, 1020, 100, 550, 730, 11, 350, 2550, 760, 740, 38, 350, 470, 2660, 43, 220, 56, 290, 155, 32, 720, 21, 380, 150, 260, 760, 23, 370, 29, 43, 32, 160, 135, 220, 120, 9, 550, 24, 240, 135, 20.)), names = c("pop", "area", "urban", "poverty", "gunreg", "homicides"), row.names = c("AL", "AR", "CA", "CT", "DC", "FL", "HI", "IA", "IL", "MA", "MD", "MI", "MN", "MO", "NC", "ND", "NJ", "NY", "OH", "PA", "RI", "SC", "TN", "TX", "UT", "WA", "AK", "AZ", "CO", "DE", "GA", "ID", "IN", "KS", "KY", "LA", "ME", "MS", "MT", "NE", "NH", "NM", "NV", "OK", "OR", "SD", "VA", "VT", "WI", "WV", "WY"), class = "data.frame") ======================================================================= "I would rather be exposed to the inconveniences attending too much liberty than to those attending too small a degree of it." -Thomas Jefferson ===============================================================http://www.reed.edu/~jones Albyn Jones jones at reed.edu Reed College, Portland OR 97202 (503)-771-1112 x7418
My favorite example on correlation !-> causation is data from Yule & Kendall on the relationship between number of licenses for radios and the number of people classified as 'mental defectives' in England and Wales from 1924-37 Of course, both were increasing over time, the latter due to increase in diagnosis using this term, the former due to increased availability, accounting for the very high correlation. I've attached the data as a SAS file.> > Subject: > [R] correlation and causality examples > From: > Jean lobry <lobry at biomserv.univ-lyon1.fr> > Date: > Sat, 15 Nov 2003 15:49:29 +0100 > To: > r-help at stat.math.ethz.ch > > > Dear All, > > I'am looking for examples showing that correlation does not imply > causality, the targeted audience consists of undergraduate students > (their first year at the university but in the BioMathStat track). > All practicals are under R.-- Michael Friendly Email: friendly at yorku.ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA
Dear R-users, many thanks to all who replied to my request, approx. one month ago, about examples illustrating that correlation does imply causality. I have tried to compile your suggestions in a web-site, which URL is given in the screenshot in png format there: http://pbil.univ-lyon1.fr/members/lobry/z.png and in jpeg format there: http://pbil.univ-lyon1.fr/members/lobry/z.jpg It's far for being perfect because of an over-teaching period, but I hope to improve it in the future, so that your suggestions and comments are always welcome. All the best, Jean -- Jean R. Lobry Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I, 43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE allo : +33 472 43 12 87 fax : +33 472 43 13 88 http://pbil.univ-lyon1.fr/members/lobry/