Michael Rennie
2004-Mar-04 18:30 UTC
[R] Testing significance in a design with unequal but proportional sample sizes
Hi, all I have a rather un-ideal dataset that I am trying to work with, and would appreciate any advice you have on the matter. I have 4 years worth of data taken at 3 depth-zones from which samples have been taken at random. I am looking at the abundance of organism A between depth zones and across years, and am interested in the possible interaction of organism A distributions shifting between depth zones over time. Unfortunately, the sample sizes (n) differ between depth zones, as follows: Year 1 2 3 4 Depth Zone 1 15 15 15 15 2 10 10 10 10 3 5 5 5 5 As such, I have a 2-way anova with unequal but proportional subclass numbers. Sokal and Rolf (3rd Ed., 1995) have a nifty method of working out sums of squares in this type of scenario (page 357, 358, box 11.6). However, they don't tell you how to calculate the probabilities, but refer the reader on to Snedecor and Cochran (1967), which I am on my way to consult shortly. I'm curious as to whether there is a more straightforward method of coding this into R, rather than having to more or less customize my own statistical test. I found some discussions in the archives revolving around type III sums of squares from 2001, but the lack of consensus around the discussion did little to assure me that I should try this approach. Anyone with advice, code or suggestions, I'd love to hear any of it. Cheers, Mike -- Michael Rennie Ph.D. Candidate University of Toronto at Mississauga 3359 Mississauga Rd. N. Mississauga ON L5L 1C6 Ph: 905-828-5452 Fax: 905-828-3792
Tom Blackwell
2004-Mar-04 20:06 UTC
[R] Testing significance in a design with unequal but proportional sample sizes
Michael - Since your email says that the data are "the abundance of organism A", I am moved to ask whether the abundances are integer counts, sometimes zero, and whether the "samples" are perhaps dips of a net, or the contents of a filter after pumping a certain amount of water through it, or something akin to 'quadrats' in forest sampling. If the abundances are integer counts, then it would be natural to analyze the data with a log-linear model using R's glm() rather than with anova. Snedecor and Cochran is an excellent book, but for this purpose Venables and Ripley's MASS (Modern Applied Statistics with S and S-plus) might be better. - tom blackwell - u michigan medical school - ann arbor - On Thu, 4 Mar 2004, Michael Rennie wrote:> Hi, all > > I have a rather un-ideal dataset that I am trying to work with, and would > appreciate any advice you have on the matter. > > I have 4 years worth of data taken at 3 depth-zones from which samples have > been taken at random. I am looking at the abundance of organism A between depth > zones and across years, and am interested in the possible interaction of > organism A distributions shifting between depth zones over time. Unfortunately, > the sample sizes (n) differ between depth zones, as follows: > > Year > 1 2 3 4 > Depth Zone 1 15 15 15 15 > 2 10 10 10 10 > 3 5 5 5 5 > > As such, I have a 2-way anova with unequal but proportional subclass numbers. > Sokal and Rolf (3rd Ed., 1995) have a nifty method of working out sums of > squares in this type of scenario (page 357, 358, box 11.6). However, they > don't tell you how to calculate the probabilities, but refer the reader on to > Snedecor and Cochran (1967), which I am on my way to consult shortly. > > I'm curious as to whether there is a more straightforward method of coding this > into R, rather than having to more or less customize my own statistical test. > I found some discussions in the archives revolving around type III sums of > squares from 2001, but the lack of consensus around the discussion did little > to assure me that I should try this approach. > > Anyone with advice, code or suggestions, I'd love to hear any of it. > > Cheers, > > Mike > -- > Michael Rennie > Ph.D. Candidate > University of Toronto at Mississauga > 3359 Mississauga Rd. N. > Mississauga ON L5L 1C6 > Ph: 905-828-5452 Fax: 905-828-3792 > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
pallier
2004-Mar-04 23:08 UTC
[R] Testing significance in a design with unequal but proportional sample sizes
Hello, This is a follow up on the question about the analysis of unbalanced data, based on my (limited) understanding of what goes in such cases. When the data is unbalanced in a factorial design, the main effect of a given factor can be defined in several ways. Which type of main effet is relevant depends on the scientific question. Some textbooks distinguish between weighted and unweighted mean effects. If you use the 'aov' function with an unbalanced design, it will report (for the first factor in the formula), the f-ratio associated to the "weighted means" solution. That is, the computation of the main effect ignores the unbalance: The effect size of a factor 'a' is computed regardless of the distributions of the units among other factors. Consider: > x<-scan() 1: 1 2 3 4: 4 5 6 7 8 9: 1 2 3 4 5 14: 6 7 8 17: Read 16 items > a<-factor(rep(c(1,2),c(8,8))) > b<-factor(rep(c(1,2,1,2),c(3,5,5,3))) > > tapply(x,list(a=a,b=b),mean) b a 1 2 1 2 6 2 3 7 > tapply(x,a,mean) 1 2 4.5 4.5 If all units are given the same weights (that is we ignore the factor 'b'), then the main effect of a is 0. This is confirmed by: > summary(aov(x~a*b)) Df Sum Sq Mean Sq F value Pr(>F) a 1 2.417e-32 2.417e-32 1.209e-32 1.0000000 b 1 60 60 30 0.0001413 *** a:b 1 5.621e-31 5.621e-31 2.810e-31 1.0000000 Residuals 12 24 2 This is called the weighted means approach because the subgroups defined by the crossing of a and b are given weights proportional the their size. Now, another approach is to forget about the individual units and just consider the table of means: > tapply(x,list(a=a,b=b),mean) b a 1 2 1 2 6 2 3 7 Forgetting about the samples' sizes, one way to defined the main effect of 'a' is as the mean of 2 and 6 versus the mean of 3 and 7: > t=tapply(x,list(a=a,b=b),mean) > diff(apply(t,1,mean)) 2 1 That is '1' One can compute a "fake" Mean Square associated to 'a' as (n-1)*effect-size=15*1=15, and compare it to the MSE from the previous ANOVA (2 with 12 d.f.) The f-ratio=15/2=7.5 reaches significance: > pf(7.5,1,12) [1] 0.9820225 > If I am correct, this is what textbooks call the "unweighted means" approach. In many cases, it is this type of main effect which is relevant. (Especialy when the unbalance is due to random missing observations.) I do not know if there is a solution with R for easily computing the unweigthed main effects and assessing their significance. (Anyone?) Actually, the different types of main effects defined above just correspond to different contrasts on the cell means. So if there is an easy solution to compute arbitrary contrasts on the cell means in a factorial design, this could an approach to this question. (Anyone?) Christophe
Prof Brian Ripley
2004-Mar-05 07:59 UTC
[R] Testing significance in a design with unequal but proportional sample sizes
On Fri, 5 Mar 2004, pallier wrote: ...> Actually, the different types of main effects defined above just > correspond to different > contrasts on the cell means. So if there is an easy solution to compute > arbitrary contrasts > on the cell means in a factorial design, this could an approach to this > question. (Anyone?)There are at least three such ways. ?contrasts (for the assignment function contrasts<-) and ?C, as well as the contrasts= argument to aov (the function you were discussing ...). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595