Hi all, I have two very large samples of data (10000+ data points) and would like to perform normality tests on it. I know that p < .05 means that a data set is considered as not normal with any of the two tests. I am also aware that large samples tend to lead more likely to normal results (Andy Field, 2005). I have a few questions to ensure that I am using them right. 1) The Shapiro-Wilk test requires to provide mean and sd. Is is correct to add here the mean and sd of the data itself (since I am comparing to a normal distribution with the same parameters) ? mySD <- sd(mydata$myfield) myMean <- mean(mydata$myfield) shapiro.test(rnorm(100, mean = myMean, sd = mySD)) 2) If I just want to test each distribution individually, I assume that I am doing a one-sample Kolmogorov-Smirnov test. Is that correct? 3) If I simply want to know if normality exists or not, what should I put for the parameter 'alternative' ? Does it actually matter? alternative = c("two.sided", "less", "greater") Thank you, Ralf
On 2010-06-23 12:05, Ralf B wrote:> Hi all, > > I have two very large samples of data (10000+ data points) and would > like to perform normality tests on it. I know that p< .05 means that > a data set is considered as not normal with any of the two tests. I am > also aware that large samples tend to lead more likely to normal > results (Andy Field, 2005).I that depends on what you mean by 'tend to lead ...'> > I have a few questions to ensure that I am using them right. > > 1) The Shapiro-Wilk test requires to provide mean and sd. Is is > correct to add here the mean and sd of the data itself (since I am > comparing to a normal distribution with the same parameters) ? > > mySD<- sd(mydata$myfield) > myMean<- mean(mydata$myfield) > shapiro.test(rnorm(100, mean = myMean, sd = mySD))I don't think that your understanding of the S-W test is correct. You would just do: shapiro.test(mydata$myfield) to test for Normality. However, shapiro.test() won't accept sample sizes greater than 5000. So use ks.test. Or use a graphical method: I like qq.plot in the 'car' package.> > 2) If I just want to test each distribution individually, I assume > that I am doing a one-sample Kolmogorov-Smirnov test. Is that correct?I don't understand this. What do you mean by 'test ... individually'?> > 3) If I simply want to know if normality exists or not, what should I > put for the parameter 'alternative' ? Does it actually matter? > > alternative = c("two.sided", "less", "greater")Leave it at the default 'two.sided' unless you have good reason to suspect that the cdf lies above or below the Normal cdf. -Peter Ehlers> > Thank you, > Ralf >
Before doing normality tests look at fortune(117) and fortune(234). If you still feel the need to have the computer print out a p-value for a test of exact normality, then try SnowsPenultimateNormalityTest in the TeachingDemos package. If you want a test that is more meaningful, then look at vis.test (also in the TeachingDemos package). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Ralf B > Sent: Wednesday, June 23, 2010 12:05 PM > To: r-help at r-project.org > Subject: [R] About normality tests... > > Hi all, > > I have two very large samples of data (10000+ data points) and would > like to perform normality tests on it. I know that p < .05 means that > a data set is considered as not normal with any of the two tests. I am > also aware that large samples tend to lead more likely to normal > results (Andy Field, 2005). > > I have a few questions to ensure that I am using them right. > > 1) The Shapiro-Wilk test requires to provide mean and sd. Is is > correct to add here the mean and sd of the data itself (since I am > comparing to a normal distribution with the same parameters) ? > > mySD <- sd(mydata$myfield) > myMean <- mean(mydata$myfield) > shapiro.test(rnorm(100, mean = myMean, sd = mySD)) > > 2) If I just want to test each distribution individually, I assume > that I am doing a one-sample Kolmogorov-Smirnov test. Is that correct? > > 3) If I simply want to know if normality exists or not, what should I > put for the parameter 'alternative' ? Does it actually matter? > > alternative = c("two.sided", "less", "greater") > > Thank you, > Ralf > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.