john.maindonald@anu.edu.au
2005-Nov-02 06:44 UTC
[Rd] Bugs/issues with model.tables() (PR#8275)
Based on what follows, the most favourable construction is that the documentation, by failing to say that the function should be used only for completely balanced designs, is deficient. Even for such designs, there is a bug that needs correction. The discussion is lengthy because I think it important to document, at least wrt effects and means, exactly what model.tables() does. My preference is to pension model.tables() off, and to replace it with the function that I describe (but do not, at this point, reproduce) below. Basically, for getting the means or effects, my function replaces the call to proj() with a call to predict.lm(.., type="terms"). The SEs for effects can also be obtained using predict.lm(). For getting SEs of differences of treatment means, the SEs that appear from summary.lm() can be used. Such a function will apply quite generally to models that include only main effects. This is surely a matter for discussion and comment, especially as model.tables() has been around since nearly the beginning of R time. Summary ~~~~~~~ Under "Description:" there is the claim: Computes summary tables for model fits, especially complex 'aov' fits. The warning that appears some way further down is perhaps intended to weaken the force of this: The implementation is incomplete, and only the simpler cases have been tested thoroughly. The implementation is useful only for balanced complete designs. The documentation should say this. In (I), I summarize issues with model.tables() In (II), I give examples, which can be extended to verify points that the output below does not specifically illustrate. I (a) Even for balanced designs, the SEs for the effects are wrong. I give an example below. I (b) model.tables() works from a sequential breakdown of the predicted values. In a model with factors A and B, with A fitted first, model.tables() calculates effects for A, unadjusted for B, then effects for B after adjusting for A. If every combination of A and B occurs equally often (a completely balanced design), the order makes no difference. If the design is not completely balanced, then the order does matter, and the "effects" will not be unbiased estimates of the treatment effects. Nor will their differences be unbiased estimates of the treatment differences. This is true even for balanced incomplete block designs. Means are obtained by adding "effects", calculated as just described, to the overall mean. The only hint on the help page that so-called "effects" and "means" relate to such a sequential breakdown of predicted values is the reference, under "See also", to proj. Under help(proj), it is written: Projection matrices from the default method have orthogonal columns representing the projection of the response onto the column space of the Q matrix from the QR decomposition. For those who know about QR, this immediatly flags a sequential breakdown of the fitted values. I (c) Standard errors for effects seeem to have in mind a formula that (with a correction that will be noted below) applies only in the completely balanced case. I (d) Standard errors for differences of means (SEDs) seem to be correct for completely balanced designs. For BIB designs they are independent of the order of terms, even though the "effects" and "means" are not. They are in general wrong, even for the "means" as given. A correct order independent breakdown of the fitted values can be obtained by replacing, in model.tables(), the call to proj() with a call to predict.lm(.., type="terms"). [This requires a bit more than a simple replacement.] The means and effects are, quite generally, the means and expects that most users will want. Once I have worked over the code with some care, and sorted out the caculation of SEs, I will offer this function for inclusion in R's stats package. Or I am happy to provide the current draft of my function to anyone who wants to look at it. II: Further Details, and Examples: (a) For calculations of SEs for effects, model.tables.aov() calls se.aov(), which uses the formula: s/sqrt(n) [the mean is over n subunits] The formula required, in the completely balanced case, is sqrt((m-1)/m)*s/sqrt(n) formula [an effect is a mean over n subunits, minus the grand mean; the grand mean is a mean over mn units.] Often, m is large enough that the difference is of no consequence. Also, the SEs of effects are commonly of little more than academic interest. But, if given at all, they should be correct. Example "bal1" <- structure(list(y = c(3.1, 2.4, 2.2, 2.3, 1.9, 0.9, 1.8, 1.1), trt = structure(as.integer(c(1, 1, 2, 2, 3, 3, 4, 4)), .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("y", "trt"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8"), class = "data.frame") > model.tables(bal1.aov, type="effects", se=TRUE)$se trt 0.3526684 > summary.lm(u)$sigma/sqrt(2)*sqrt(3/4) # m=4 [1] 0.3054198 " > unique(predict.lm(bal1.aov, type="terms", se=TRUE)$se) trt 1 0.3054198 II (d) In the interests of brevity (sic!), I will limit attention to means: "bdes" <- structure(list(trt = structure(as.integer(c(1, 2, 1, 3, 1, 4, 2, 3, 2, 4, 3, 4)), .Label = c("a", "b", "c", "d"), class = "factor"), blk = structure(as.integer(c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)), .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"), y = c(0.8, -1.1, 4.5, 3.3, 4.3, 4.9, 0.6, 3.9, 4.6, 9.4, 3.7, 5.7)), .Names = c("trt", "blk", "y"), row.names = c("1","2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame") > # Crude block means; these are meaningless > tapply(bdes$y, bdes$blk, mean) A B C D E F -0.15 3.90 4.60 2.25 7.00 4.70 > # Crude treatment means > tapply(bdes$y, bdes$trt, mean) a b c d 3.200000 1.366667 3.633333 6.666667 > ## aov fits > bdes.aov <- aov(y~blk+trt, data=bdes) # Blocks first > bdes_trtFirst.aov <- aov(y~trt+blk, data=bdes) # trt first > model.tables(bdes.aov, type="means")[["table"]][["blk"]] blk A B C D E F -0.15 3.90 4.60 2.25 7.00 4.70 > # Observe that these agree with the above crude block means > model.tables(bdes_trtFirst.aov, type="means")[["table"]][["trt"]] trt a b c d 3.200000 1.366667 3.633333 6.666667 > > ## Treatment means, when blocks are taken first > model.tables(bdes.aov, type="means")[["table"]][["trt"]] trt a b c d 4.133333 2.050000 3.733333 4.950000 > ## Also, note their differences > diff(model.tables(bdes.aov, type="means")[["table"]][["trt"]]) trt b c d -2.083333 1.683333 1.216667 > ## Treatment means, from the "usual" least squares analysis > dummy.coef(bdes.aov)[["(Intercept)"]] (Intercept) 1.4125 > dummy.coef(bdes.aov)[["(Intercept)"]]+mean(dummy.coef(bdes.aov)$blk)+ + dummy.coef(bdes.aov)$trt a b c d 4.341667 1.216667 3.741667 5.566667 > diff(dummy.coef(bdes.aov)$trt) b c d -3.125 2.525 1.825 > diff(model.tables(bdes.aov, type="means")[["table"]][["trt"]])/ + diff(dummy.coef(bdes.aov)$trt) trt b c d 0.6666667 0.6666667 0.6666667 > # This factor is some simple function of the BIB design parameters, > # which I am too lazy or too busy to work out. --please do not edit the information below-- Version: platform = powerpc-apple-darwin7.9.0 arch = powerpc os = darwin7.9.0 system = powerpc, darwin7.9.0 status major = 2 minor = 2.0 year = 2005 month = 10 day = 06 svn rev = 35749 language = R Locale: C Search Path: .GlobalEnv, cuckoohosts, file:~/r/ch2/.RData, file:../.RData, package:methods, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, Autoloads, package:base John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.