Liviu Andronic
2009-May-26 19:10 UTC
[R] (OT) Does pearson correlation assume bivariate normality of the data?
Dear all, The other day I was reading this post [1] that slightly surprised me: "To reject the null of no correlation, an hypothsis test based on the normal distribution. If normality is not the base assumption your working from then p-values, significance tests and conf. intervals dont mean much (the value of the coefficient is not reliable) " (BOB SAMOHYL). To me this implied that in practice Pearson's product-moment correlation (and associated significance) is often used incorrectly . Then I went wrestling with the literature, and with my friends on what does the Pearson correlation actually impose, and after about a week I'm still head-banging against divergent opinions. From what I understand there are two aspects to this classical parametric procedure: 1. Estimating the magnitude of the correlation: - the sample data should come from a bivariate normal distribution (?cor, ?cor.test, Dalgaard 2003, somewhat implied in many examples such as ?rrcov::maryo or Wilcox 2005) - the sample data should be (I presume univariate) normal (Crawley 2007) - the sample data can be of any distribution (if I understand correctly the `distribution-free' definition of correlation in Huber 1981, 2004) - the sample data could come from just about any bivariate distribution (Wikipedia [2][3] and associated reference) - the coefficient is (very) not robust to univariate outliers (e.g., Huber 1981), and to multivariate outliers (?rrcov::maryo with data from Marona and Yohai 1998) 2. Assessing whether the correlation is significantly different from zero (using a statistic following the t distribution): - the data should come from independent normal distributions (?cor.test) - at least one of the marginal distributions is normal (Wilcox 2005) Surprisingly (to me) many sources seem quite evasive on clearly defining the pearson correlation. Reading the literature I was pretty much convinced that the correlation coefficient is not robust to outliers. The literature is also convincing on the impact of contaminated normal, heavy-tailed distributions on parametric tests (invalidating their results). However, I'm not clear on the distributional assumptions on the data: - does the data have to be bivariate normal in order to correctly estimate the linear correlation? - does the data have to be univariate normal in order to correctly estimate the significance of the correlation? If the above is true, what are the preferable alternatives for non-gaussian data (including heavy-tailed normal)? non-parametric tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk, robust::covRob()? hypothesis testing via Permutation Tests [4]? is there a robust cor.test? other robust tests of independence? Thank you, Liviu [1] nabble.com/Correlation-on-Tick-Data-tp18589474p18595197.html [2] en.wikipedia.org/wiki/Correlation#Sensitivity_to_the_data_distribution [3] en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Sensitivity_to_the_data_distribution [4] burns-stat.com/pages/Tutor/bootstrap_resampling.html#permtest -- Do you know how to read? alienetworks.com/srtest.cfm Do you know how to write? garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail
Thomas Lumley
2009-May-26 21:01 UTC
[R] (OT) Does pearson correlation assume bivariate normality of the data?
This is the sort of problem (another related one is the assumptions of the t-test) that attracts a lot of relatively inefficient argument. Some basic points 1. If random variables X and Y are uncorrelated (and have finite moments, but that's a purely technical issue), the distribution of the Pearson correlation coefficient in samples from X and Y will be Normal with mean zero in large samples. No further assumption about distribution is needed. So, the test is valid in sufficiently large samples. 2. Similarly, the sample correlation coefficient between two random variables X and Y is a consistent estimator of the correlation between X and Y. Here the distribution [needed for confidence intervals] does depend on the distributions of X and Y, but by less than you might expect. For example, I found that Fisher's z-transformation and a t-distribution with n-3 df is a pretty good approximation to the distribution of correlation between lognormal random variables (a model for air pollution data) with a sample size of 10. 3. If X and Y are bivariate Normal and uncorrelated, they must be independent, so the null hypothesis of zero correlation is especially interesting for Normal data. 4. Zero correlation may still be an interesting null hypothesis without bivariate Normality -- if you don't know much about X and Y it may be an advance to be able to establish that Y tends to be higher when X is higher. 5. The correlation coefficient is sensitive to outlying observations. This is not necessarily a bad thing, but it means that if X and Y both have long-tailed distributions the test for zero correlation will be sensitive primarily to the tails. 6. If the tails of the distribution are mostly gross-error contamination, the sensitivity to the tails is bad. 7. The various robust or rank-based correlations don't estimate the same thing, any more than the mean and median estimate the same thing. They don't necessarily even have to have the same sign. Some of them are intended for bivariate Normal data with gross-error contamination, which is fine if that is what you have. Kendall's tau at least has a sensible interpretation that doesn't depend on distributions, whereas it's not clear to me why the hypothesis of zero Spearman correlation would be interesting without distributional assumptions. 8. Permutation tests will give you an exact small-sample test of *independence*, not of zero correlation. The test is not exact (it may be conservative or anticonservative) if X and Y are dependent but uncorrelated. The test has power only against alternatives where the correlation is non-zero. Some of the issues behind the confusion are the same as for the t-test: - a confusion of necessary vs sufficient assumptions - a confusion of long-tailed distributions and gross error contamination - worrying about the meaning of the null hypothesis only for 'parametric' tests and not for 'non-parametric tests' - not understanding that permutation tests have assumptions. There is also some genuine and informed disagreement about the relative importance of potential problems. Some of this disagreement is about philosophical issues, and some is about the likely pratical impact, which depends a lot on the setting. -thomas On Tue, 26 May 2009, Liviu Andronic wrote:> Dear all, > The other day I was reading this post [1] that slightly surprised me: > "To reject the null of no correlation, an hypothsis test based on the > normal distribution. If normality is not the base assumption your > working from then p-values, significance tests and conf. intervals > dont mean much (the value of the coefficient is not reliable) " (BOB > SAMOHYL). > > To me this implied that in practice Pearson's product-moment > correlation (and associated significance) is often used incorrectly . > Then I went wrestling with the literature, and with my friends on what > does the Pearson correlation actually impose, and after about a week > I'm still head-banging against divergent opinions. From what I > understand there are two aspects to this classical parametric > procedure: > 1. Estimating the magnitude of the correlation: > - the sample data should come from a bivariate normal distribution > (?cor, ?cor.test, Dalgaard 2003, somewhat implied in many examples > such as ?rrcov::maryo or Wilcox 2005) > - the sample data should be (I presume univariate) normal (Crawley > 2007) > - the sample data can be of any distribution (if I understand > correctly the `distribution-free' definition of correlation in Huber > 1981, 2004) > - the sample data could come from just about any bivariate > distribution (Wikipedia [2][3] and associated reference) > - the coefficient is (very) not robust to univariate outliers (e.g., > Huber 1981), and to multivariate outliers (?rrcov::maryo with data > from Marona and Yohai 1998) > > 2. Assessing whether the correlation is significantly different from > zero (using a statistic following the t distribution): > - the data should come from independent normal distributions (?cor.test) > - at least one of the marginal distributions is normal (Wilcox 2005) > > Surprisingly (to me) many sources seem quite evasive on clearly > defining the pearson correlation. Reading the literature I was pretty > much convinced that the correlation coefficient is not robust to > outliers. The literature is also convincing on the impact of > contaminated normal, heavy-tailed distributions on parametric tests > (invalidating their results). However, I'm not clear on the > distributional assumptions on the data: > - does the data have to be bivariate normal in order to correctly > estimate the linear correlation? > - does the data have to be univariate normal in order to correctly > estimate the significance of the correlation? > > If the above is true, what are the preferable alternatives for > non-gaussian data (including heavy-tailed normal)? non-parametric > tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk, > robust::covRob()? hypothesis testing via Permutation Tests [4]? is > there a robust cor.test? other robust tests of independence? > > Thank you, > Liviu > > [1] nabble.com/Correlation-on-Tick-Data-tp18589474p18595197.html > [2] en.wikipedia.org/wiki/Correlation#Sensitivity_to_the_data_distribution > [3] en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Sensitivity_to_the_data_distribution > [4] burns-stat.com/pages/Tutor/bootstrap_resampling.html#permtest > > > > -- > Do you know how to read? > alienetworks.com/srtest.cfm > Do you know how to write? > garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail > > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Liviu Andronic
2009-May-29 07:38 UTC
[R] (OT) Does pearson correlation assume bivariate normality of the data?
Thanks all for the on- and off-list responses. For a relevant discussion see the "normality tests" thread [1], and specifically another excellent overview by Thomas on the robustness of the t-test [2]. Best, Liviu [1] thread.gmane.org/gmane.comp.lang.r.general/86180 [2] article.gmane.org/gmane.comp.lang.r.general/86353 On Tue, May 26, 2009 at 11:01 PM, Thomas Lumley <tlumley at u.washington.edu> wrote:> This is the sort of problem (another related one is the assumptions of the > t-test) that attracts a lot of relatively inefficient argument. >