Dear all, I'm analysing a split-plot experiment, where there are sometimes one or two values missing. I realized that if the data is slightly unbalanced, the effect of the subplot-treatment will also appear and be tested against the mainplot-error term. I replicated this with the Oats dataset from Yates (1935), contained in the nlme package, where Variety is on mainplot, and nitro on subplot. > # Oats dataset (Yates 1935) from the nlme package > require(nlme); data <- get(data(Oats)) > data$nitro <- factor(data$nitro);data$Block <- as.factor(as.character(data$Block)) > nrow(data) # 6 Blocks * 4 N-levels * 3 Varieties = 72 obs -> orthogonal and balanced [1] 72 > # split-plot anova > summary(aov(yield ~ Block+Variety*nitro + Error(Block/Variety),data)) Error: Block ????? Df Sum Sq Mean Sq Block? 5? 15875??? 3175 Error: Block:Variety ????????? Df Sum Sq Mean Sq F value Pr(>F) Variety??? 2?? 1786?? 893.2?? 1.485? 0.272 Residuals 10?? 6013?? 601.3 Error: Within ????????????? Df Sum Sq Mean Sq F value?? Pr(>F) nitro????????? 3? 20020??? 6674? 37.686 2.46e-12 *** Variety:nitro? 6??? 322????? 54?? 0.303??? 0.932 Residuals???? 45?? 7969???? 177 --- Signif. codes:? 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > reduceddata <- data[-5,] # delete one observation > summary(aov(yield ~ Block+Variety*nitro + Error(Block/Variety),reduceddata)) Error: Block ????? Df Sum Sq Mean Sq Block? 5? 16070??? 3214 Error: Block:Variety ????????? Df Sum Sq Mean Sq F value Pr(>F) Variety??? 2?? 1820?? 910.0?? 1.373? 0.302 nitro????? 1????? 1???? 1.0?? 0.001? 0.970 Residuals? 9?? 5964?? 662.7 Error: Within ????????????? Df Sum Sq Mean Sq F value?? Pr(>F) nitro????????? 3? 19766??? 6589?? 36.88 4.49e-12 *** Variety:nitro? 6??? 333????? 55??? 0.31??? 0.928 Residuals???? 44?? 7860???? 179 Although I fully understand, that mixed-models should be preferred for non-orthogonal/unbalanced data, I think it's still valid to use the aov() approach, when only 1 or 2 datapoints are missing. Also, I don't really understand, why the structure of the ANOVA changes suddenly. Is there any argument, I could supply to change this behaviour? When I do the same with lm() and subsequent anova(), and calculate F-value for Variety by hand, the estimates are still quite robust. Best regards, Sam -- Samuel Knapp Lehrstuhl f?r Pflanzenern?hrung Technische Universit?t M?nchen (Chair of Plant Nutrition Technical University of Munich) Emil-Ramann-Strasse 2 D-85354 Freising Tel. +49 8161 71-3578 samuel.knapp at tum.de www.researchgate.net/profile/Samuel_Knapp -- Samuel Knapp Lehrstuhl f?r Pflanzenern?hrung Technische Universit?t M?nchen (Chair of Plant Nutrition Technical University of Munich) Emil-Ramann-Strasse 2 D-85354 Freising Tel. +49 8161 71-3578 samuel.knapp at tum.de www.researchgate.net/profile/Samuel_Knapp [[alternative HTML version deleted]]
peter dalgaard
2017-Oct-11 09:55 UTC
[R] Unbalanced data in split-plot analysis with aov()
> On 11 Oct 2017, at 00:07 , Samuel Knapp <samuel.knapp at tum.de> wrote: > > Dear all, > > I'm analysing a split-plot experiment, where there are sometimes one or > two values missing. I realized that if the data is slightly unbalanced, > the effect of the subplot-treatment will also appear and be tested > against the mainplot-error term. > > I replicated this with the Oats dataset from Yates (1935), contained in > the nlme package, where Variety is on mainplot, and nitro on subplot. > >> # Oats dataset (Yates 1935) from the nlme package >> require(nlme); data <- get(data(Oats)) >> data$nitro <- factor(data$nitro);data$Block <- > as.factor(as.character(data$Block)) >> nrow(data) # 6 Blocks * 4 N-levels * 3 Varieties = 72 obs -> > orthogonal and balanced > [1] 72 >> # split-plot anova >> summary(aov(yield ~ Block+Variety*nitro + Error(Block/Variety),data)) > > Error: Block > Df Sum Sq Mean Sq > Block 5 15875 3175 > > Error: Block:Variety > Df Sum Sq Mean Sq F value Pr(>F) > Variety 2 1786 893.2 1.485 0.272 > Residuals 10 6013 601.3 > > Error: Within > Df Sum Sq Mean Sq F value Pr(>F) > nitro 3 20020 6674 37.686 2.46e-12 *** > Variety:nitro 6 322 54 0.303 0.932 > Residuals 45 7969 177 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 >> reduceddata <- data[-5,] # delete one observation >> summary(aov(yield ~ Block+Variety*nitro + > Error(Block/Variety),reduceddata)) > > Error: Block > Df Sum Sq Mean Sq > Block 5 16070 3214 > > Error: Block:Variety > Df Sum Sq Mean Sq F value Pr(>F) > Variety 2 1820 910.0 1.373 0.302 > nitro 1 1 1.0 0.001 0.970 > Residuals 9 5964 662.7 > > Error: Within > Df Sum Sq Mean Sq F value Pr(>F) > nitro 3 19766 6589 36.88 4.49e-12 *** > Variety:nitro 6 333 55 0.31 0.928 > Residuals 44 7860 179 > > Although I fully understand, that mixed-models should be preferred for > non-orthogonal/unbalanced data, I think it's still valid to use the > aov() approach, when only 1 or 2 datapoints are missing. Also, I don't > really understand, why the structure of the ANOVA changes suddenly. Is > there any argument, I could supply to change this behaviour?The answer to this is in the math, and not really possible to cover here. If things are non-orthogonal, you don't have the pretty separation of effects into error strata and one effect of that is that you can see the same effect occurring multiple time with different error terms. (In a plain block design, if not all treatments occur equally often in each block, the block means contain information about the treatment effects if block effects are considered random). Worse, if the balanced structure of the design is broken, the orthogonal decomposition of random effects does not correspond to the covariance structure that you think you are assuming. The consequences of this require considerable expertise to figure out. (I forget the details, but the nature is like having to assume variance of random effects to depend on block size.) One known-valid way of dealing with a small number of missing values, is to retain the design, set the value of the missing values to something arbitrary, and include a dummy variable for each missing observation which is 1 for the position of that value and zero elsewhere. You then still get the effect of estimating the dummy coefficient in multiple strata corresponding to an extended model where the effect of the dummy is different for block means than for residuals. However, at least you then know fairly well what is going on. So, something like this (removing block from the fixed effects because you also had is as a random effect):> i5 <- rep(0,72); i5[5] <- 1 > summary(aov(yield ~ i5 + nitro*Variety + Error(Block/Variety),data))Error: Block Df Sum Sq Mean Sq F value Pr(>F) i5 1 14163 14163 33.08 0.00453 ** Residuals 4 1713 428 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Error: Block:Variety Df Sum Sq Mean Sq F value Pr(>F) i5 1 26 26.0 0.039 0.847 Variety 2 1809 904.7 1.365 0.304 Residuals 9 5964 662.7 Error: Within Df Sum Sq Mean Sq F value Pr(>F) i5 1 352 352 1.971 0.167 nitro 3 19766 6589 36.885 4.49e-12 *** nitro:Variety 6 333 55 0.310 0.928 Residuals 44 7860 179 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 You will notice that in this analysis, you can modify the value for yield[5] at will, and still get the same output, except of course for the "i5" entries. Also, quite interestingly, this reproduces the analysis with the 5th obs removed, _provided_ that you put nitro before Variety in the model spec:> summary(aov(yield ~ nitro*Variety + Error(Block/Variety),data[-5,]))Error: Block Df Sum Sq Mean Sq F value Pr(>F) nitro 1 14357 14357 33.53 0.00442 ** Residuals 4 1713 428 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Error: Block:Variety Df Sum Sq Mean Sq F value Pr(>F) nitro 1 11 11.5 0.017 0.898 Variety 2 1809 904.7 1.365 0.304 Residuals 9 5964 662.7 Error: Within Df Sum Sq Mean Sq F value Pr(>F) nitro 3 19766 6589 36.88 4.49e-12 *** nitro:Variety 6 333 55 0.31 0.928 Residuals 44 7860 179 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 I suppose this means that the nitro term adjusts the analyses in the lower error strata in the same way as the missing value indicator, but I really can't explain the effect in detail. -pd> > When I do the same with lm() and subsequent anova(), and calculate > F-value for Variety by hand, the estimates are still quite robust. > > Best regards, > > Sam > > > -- > Samuel Knapp > > Lehrstuhl f?r Pflanzenern?hrung > Technische Universit?t M?nchen > (Chair of Plant Nutrition > Technical University of Munich) > > Emil-Ramann-Strasse 2 > D-85354 Freising > > Tel. +49 8161 71-3578 > samuel.knapp at tum.de > www.researchgate.net/profile/Samuel_Knapp > > -- > Samuel Knapp > > Lehrstuhl f?r Pflanzenern?hrung > Technische Universit?t M?nchen > (Chair of Plant Nutrition > Technical University of Munich) > > Emil-Ramann-Strasse 2 > D-85354 Freising > > Tel. +49 8161 71-3578 > samuel.knapp at tum.de > www.researchgate.net/profile/Samuel_Knapp > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com