Cecile De Cat
2012-Apr-18 15:21 UTC
[R] normal distribution assumption for multi-level modelling
Hello, I'm analysing reaction time data from a linguistic experiment (a variant of a lexical decision task). To ascertain that the data was normally distributed, I used *shapiro.test *for each participant (see commands below), but only one out of 21 returns a p value above p.0 05.> f = function(dfr) return(shapiro.test(dfr$Target.RTinv)$p.value) > p = as.vector(by(newdat, newdat$Subject, f)) > names(p) = levels(newdat$Subject) > names(p[p < 0.05])Removing a few outliers per subject doesn't make a difference, and "aggressive" removal of outliers (done by subject, for each of the 6 conditions ) still results in non-normally distributed data by subject. Does this invalidate any attempt at multi-level modelling? Many thanks in advance for your help. Cecile [[alternative HTML version deleted]]
Ben Bolker
2012-Apr-18 18:01 UTC
[R] normal distribution assumption for multi-level modelling
Cecile De Cat <c.decat <at> leeds.ac.uk> writes:> I'm analysing reaction time data from a linguistic experiment (a variant of > a lexical decision task). To ascertain that the data was normally > distributed, I used *shapiro.test *for each participant (see commands > below), but only one out of 21 returns a p value above p.0 05. > > > f = function(dfr) return(shapiro.test(dfr$Target.RTinv)$p.value) > > p = as.vector(by(newdat, newdat$Subject, f)) > > names(p) = levels(newdat$Subject) > > names(p[p < 0.05]) > > Removing a few outliers per subject doesn't make a difference, and > "aggressive" removal of outliers (done by subject, for each of the 6 > conditions ) still results in non-normally distributed data by subject. > > Does this invalidate any attempt at multi-level modelling?I don't think so. 1. You should be concerned about the normality the *residuals* of your response variable, i.e. the conditional distribution of your data (or if you only have categorical predictors you could equivalently look *within* the smallest sampling unit where you expect a constant mean), not the marginal distribution of the data. 2. Many statisticians would say you shouldn't be doing hypothesis tests of normality for this purpose in any case; if you have little data the tests have low power (so you won't detect non-normal data), while if you have a great deal the tests can be *too* powerful (i.e. you detect significant deviations of normality which do not actually compromise the inferences you would be making from your analysis). I don't have a great citation for this handy, but one is listed below (Cherry 1998). 3. You're not applying any multiple-comparisons correction, so getting 1/20 (let alone out of 1/21) p values <0.05 is exactly as expected if the null hypothesis were true. Follow-ups to r-sig-mixed-models <at> r-project.org, although this issue (hypothesis testing as a way to validate the statistical assumptions of a model) is not specific to mixed models. @article{cherry_statistical_1998, title = {Statistical Tests in Publications of The Wildlife Society}, volume = {26}, issn = {0091-7648}, url = {http://www.jstor.org/stable/3783574}, number = {4}, journal = {Wildlife Society Bulletin}, author = {Cherry, Steve}, month = dec, year = {1998}, pages = {947--953} }
Bert Gunter
2012-Apr-18 18:55 UTC
[R] normal distribution assumption for multi-level modelling
Cecile: On Wed, Apr 18, 2012 at 8:21 AM, Cecile De Cat <c.decat at leeds.ac.uk> wrote:> Hello, > > I'm analysing reaction time data from a linguistic experiment (a variant of > a lexical decision task). ? To ascertain that the data was normally > distributed, I used *shapiro.test *for each participant (see commands > below), but only one out of 21 returns a p value above p.0 05. > >> f = function(dfr) return(shapiro.test(dfr$Target.RTinv)$p.value) >> p = as.vector(by(newdat, newdat$Subject, f)) >> names(p) = levels(newdat$Subject) >> names(p[p < 0.05]) > > Removing a few outliers!! Yikes!! I won't say "Don't do this." But I will say that this can be a very dangerous and unscientific thing to do, leading to biased, misleading results. per subject doesn't make a difference, and> "aggressive" removal of outliers (done by subject, for each of the 6 > conditions ) still results in non-normally distributed data by subject. > > Does this invalidate any attempt at multi-level modelling?How can we possibly know without knowing in detail the objectives of the investigation, the nature of the data, and the details of the analysis you did??! On general principles, normality is rarely of any real importance; lack of independence (or, in general, non-adherence to the covariance structures specified) usually is. So "any attempt" seems too general a claim to support. Indeed, a good graphical analysis -- often the most scientifically informative thing to do anyway -- is almost always a good thing to do. As this has little to do with R, you should follow up on a statistical list, like stats.stackexchange.com . -- Bert> > Many thanks in advance for your help. > > Cecile > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
Cecile De Cat
2012-Apr-19 13:49 UTC
[R] normal distribution assumption for multi-level modelling
Thanks. ??I appreciate this isn't strictly an R question and will pursue on another list. The procedure I followed was inspired from @article{ ? ?Author = {Baayen, R. Harald and Milin, Petar}, ? ?Title = {Analysing Reaction Times}, ? ?Journal = {International Journal of Psychological Research}, ? ?Volume = {3}, ? ?Number = {2}, ? ?Pages = {12--28}, ? ? ? Year = {2010} } Best, Cecile On 18 April 2012 19:55, Bert Gunter <gunter.berton at gene.com> wrote:> > Cecile: > > On Wed, Apr 18, 2012 at 8:21 AM, Cecile De Cat <c.decat at leeds.ac.uk> wrote: > > Hello, > > > > I'm analysing reaction time data from a linguistic experiment (a variant of > > a lexical decision task). ? To ascertain that the data was normally > > distributed, I used *shapiro.test *for each participant (see commands > > below), but only one out of 21 returns a p value above p.0 05. > > > >> f = function(dfr) return(shapiro.test(dfr$Target.RTinv)$p.value) > >> p = as.vector(by(newdat, newdat$Subject, f)) > >> names(p) = levels(newdat$Subject) > >> names(p[p < 0.05]) > > > > Removing a few outliers > > !! Yikes!! I won't say "Don't do this." But I will say that this can > be a very dangerous and unscientific thing to do, leading to biased, > misleading results. > > ?per subject doesn't make a difference, and > > "aggressive" removal of outliers (done by subject, for each of the 6 > > conditions ) still results in non-normally distributed data by subject. > > > > Does this invalidate any attempt at multi-level modelling? > > How can we possibly know without knowing in detail the objectives of > the investigation, the nature of the data, and the details of the > analysis you did??! > > On general principles, normality is rarely of any real importance; > lack of independence (or, in general, non-adherence to the covariance > structures specified) usually is. ?So "any attempt" seems too general > a claim to support. Indeed, a good graphical analysis -- often the > most scientifically informative thing to do anyway -- is almost always > a good thing to do. > > As this has little to do with R, you should follow up on a statistical > list, like stats.stackexchange.com . > > -- Bert > > > > Many thanks in advance for your help. > > > > Cecile > > > > ? ? ? ?[[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > > Bert Gunter > Genentech Nonclinical Biostatistics > > Internal Contact Info: > Phone: 467-7374 > Website: > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm