Hi *I have been looking for a reference to explain how R uses the aov command(at a deeper level)*. More specifically, how R reads the formulae and R computes the sums of squares. I am not interested in understanding what the difference of Type 1,2,3 sum of squares are. I am more interested in finding out about how R computes ~x1:x2:x3 or how R computes ~A:x1 emphasizing sequential nature of the way it computes, and models even more complicated than this. Yours sincerely, Justin *I check my email at 9AM and 4PM everyday* *If you have an EMERGENCY, contact me at +447938674419(UK) or +60125056192(Malaysia)* [[alternative HTML version deleted]]
I am not aware of a detailed documentation of this beyond the actual source code. However, the principles are fairly straightforward, except that the rules for constructing the design matrix from the model formula can be a bit arcane at times. The two main tools are the design matrix constructor (model.matrix) and a Gram-Schmidt type ortogonalization of its columns (the latter is called a QR decomposition in R, which it is, but there are several algorithms for QR, and the linear models codes depend on the QR algorithm being based on orthogonalization - so LINPACK works and LAPACK doesn't). An ortogonalized model matrix generates a decomposition of the model space into orthogonal subspaces corresponding to the terms of the model. Projections onto each of the subspaces are easily worked out. E.g., for a two-way analysis (Y~row+col+row:col) you can decompose the model effect as a row effect, a column effect, and an interaction effect. This allows straightforward calculation of the sums of squares of the ANOVA table. As you are probably well aware, if the design is unbalanced, the results will depend on the order of terms -- col + row + col:row gives a different result. What aov() does is that it first decomposes the observations according to the Error() term, forming the error strata, then fits the systematic part of the model to each stratum in turn. In the nice cases, each term of the model will be estimable in exactly one stratum, and part of the aov() logic is to detect and remove unestimable terms. E.g., if you have a balanced two way layout, say individual x treatment, the variable gender is a subfactor of individual, so Y ~ gender * treatment + Error(individual/treatment), the gender effect is estimated in the individual stratum, whereas treatment and gender:treatment are estimated in the individual:treatment stratum. It should be noted that it is very hard to interpret the results of aov() unless the Error() part of the model corresponds to a balanced experimental design. Or put more sharply: The model implied by the decomposition into error strata becomes nonsensical otherwise. If you do have a balanced design, the error strata reduce to simple combinations of means and observation, so the aov() algorithm is quite inefficient, but to my knowledge nobody has bothered to try and do better. -pd> On 13 Jul 2016, at 18:18 , Justin Thong <justinthong93 at gmail.com> wrote: > > Hi > > *I have been looking for a reference to explain how R uses the aov > command(at a deeper level)*. More specifically, how R reads the formulae > and R computes the sums of squares. I am not interested in understanding > what the difference of Type 1,2,3 sum of squares are. I am more interested > in finding out about how R computes ~x1:x2:x3 or how R computes ~A:x1 > emphasizing sequential nature of the way it computes, and models even more > complicated than this. > > Yours sincerely, > Justin > > *I check my email at 9AM and 4PM everyday* > *If you have an EMERGENCY, contact me at +447938674419(UK) or > +60125056192(Malaysia)* > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Hi Peter, Thank you for your good answer. I am sorry for the late reply. *An ortogonalized model matrix generates a decomposition of the model space into orthogonal subspaces corresponding to the terms of the model. Projections onto each of the subspaces are easily worked out. E.g., for a two-way analysis (Y~row+col+row:col) you can decompose the model effect as a row effect, a column effect, and an interaction effect. This allows straightforward calculation of the sums of squares of the ANOVA table. As you are probably well aware, if the design is unbalanced, the results will depend on the order of terms -- col + row + col:row gives a different result.* It may be a stupid question. How are projections of each sums of squares easily worked out and how does the sums of squares follow easily? Does it matter that certain parameters of the model are not estimated. R appears to just give a sums of squares despite some of the parameters being non-estimable. Thank you On 14 July 2016 at 09:50, peter dalgaard <pdalgd at gmail.com> wrote:> I am not aware of a detailed documentation of this beyond the actual > source code. > However, the principles are fairly straightforward, except that the rules > for constructing the design matrix from the model formula can be a bit > arcane at times. > > The two main tools are the design matrix constructor (model.matrix) and a > Gram-Schmidt type ortogonalization of its columns (the latter is called a > QR decomposition in R, which it is, but there are several algorithms for > QR, and the linear models codes depend on the QR algorithm being based on > orthogonalization - so LINPACK works and LAPACK doesn't). > > An ortogonalized model matrix generates a decomposition of the model space > into orthogonal subspaces corresponding to the terms of the model. > Projections onto each of the subspaces are easily worked out. E.g., for a > two-way analysis (Y~row+col+row:col) you can decompose the model effect as > a row effect, a column effect, and an interaction effect. This allows > straightforward calculation of the sums of squares of the ANOVA table. As > you are probably well aware, if the design is unbalanced, the results will > depend on the order of terms -- col + row + col:row gives a different > result. > > What aov() does is that it first decomposes the observations according to > the Error() term, forming the error strata, then fits the systematic part > of the model to each stratum in turn. In the nice cases, each term of the > model will be estimable in exactly one stratum, and part of the aov() logic > is to detect and remove unestimable terms. E.g., if you have a balanced two > way layout, say individual x treatment, the variable gender is a subfactor > of individual, so Y ~ gender * treatment + Error(individual/treatment), the > gender effect is estimated in the individual stratum, whereas treatment and > gender:treatment are estimated in the individual:treatment stratum. > > It should be noted that it is very hard to interpret the results of aov() > unless the Error() part of the model corresponds to a balanced experimental > design. Or put more sharply: The model implied by the decomposition into > error strata becomes nonsensical otherwise. If you do have a balanced > design, the error strata reduce to simple combinations of means and > observation, so the aov() algorithm is quite inefficient, but to my > knowledge nobody has bothered to try and do better. > > -pd > > > On 13 Jul 2016, at 18:18 , Justin Thong <justinthong93 at gmail.com> wrote: > > > > Hi > > > > *I have been looking for a reference to explain how R uses the aov > > command(at a deeper level)*. More specifically, how R reads the formulae > > and R computes the sums of squares. I am not interested in understanding > > what the difference of Type 1,2,3 sum of squares are. I am more > interested > > in finding out about how R computes ~x1:x2:x3 or how R computes ~A:x1 > > emphasizing sequential nature of the way it computes, and models even > more > > complicated than this. > > > > Yours sincerely, > > Justin > > > > *I check my email at 9AM and 4PM everyday* > > *If you have an EMERGENCY, contact me at +447938674419(UK) or > > +60125056192(Malaysia)* > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > > >-- Yours sincerely, Justin *I check my email at 9AM and 4PM everyday* *If you have an EMERGENCY, contact me at +447938674419(UK) or +60125056192(Malaysia)* [[alternative HTML version deleted]]
In general, some assembly is required. This stuff doesn't quit fit in a one page email. You may want to start playing around with model.matrix and qr() to see the light. One tricky bit is how much you know already. There is some risk for me of having to rewrite a linear algebra textbook... I'll assume that you know the betahat=(X'X)^{-1}X'y formula and the corresponding projection yhat = X (X'X)^{-1}X'y. Now assume that X is orthogonal, then X'X is the identity matrix so the whole thing becomes X X'Y = sum(x_i x_i'y) and the (uncorrected) sum of squares for regression is sum((x_i' y)^2). If you have an arbitrary design matrix X and orthgonalize it successively, then you have say Xtilde with the property that Xtilde is orthogonal and that, for any k, the first k columns of Xtilde spans the same subspace as the first k columns of X. This means that the projections onto the first k columns of X and the corresponding regression SS might as well be based on Xtilde and the simpler formulas for the orthogonal case. This also goes for differences between projections onto the first k1 and k2 columns. In connection with stratified models, it happens that some terms of the systematic part are not estimable in all strata so e.g. in a paired design on individual subjects, you cannot estimate a gender effect from the within-subject stratum. This looks at differences within the same subject so a gender main effect cancels out. Technically, what happens is that the design matrix for the systematic model is projected (columnwise) onto each stratum and one or more of the resulting matrices may be singular, this is detected numerically and the corresponding terms are removed from the model. The difficulty here is to detect the aliasing reliably and get the DF right. -pd> On 27 Jul 2016, at 22:34 , Justin Thong <justinthong93 at gmail.com> wrote: > > Hi Peter, > > (EDIT) > Thank you for your good answer. I am sorry for the late reply. > > An ortogonalized model matrix generates a decomposition of the model space into orthogonal subspaces corresponding to the terms of the model. Projections onto each of the subspaces are easily worked out. E.g., for a two-way analysis (Y~row+col+row:col) you can decompose the model effect as a row effect, a column effect, and an interaction effect. This allows straightforward calculation of the sums of squares of the ANOVA table. As you are probably well aware, if the design is unbalanced, the results will depend on the order of terms -- col + row + col:row gives a different result. > > It may be a stupid question. How are projections of each subspace easily worked out and how does the sums of squares follow easily? Does it matter that certain parameters of the model are not estimated. R appears to just give a sums of squares despite some of the parameters being non-estimable. > > Thank you > > On 27 July 2016 at 21:34, Justin Thong <justinthong93 at gmail.com> wrote: > Hi Peter, > > > Thank you for your good answer. I am sorry for the late reply. > > An ortogonalized model matrix generates a decomposition of the model space into orthogonal subspaces corresponding to the terms of the model. Projections onto each of the subspaces are easily worked out. E.g., for a two-way analysis (Y~row+col+row:col) you can decompose the model effect as a row effect, a column effect, and an interaction effect. This allows straightforward calculation of the sums of squares of the ANOVA table. As you are probably well aware, if the design is unbalanced, the results will depend on the order of terms -- col + row + col:row gives a different result. > > It may be a stupid question. How are projections of each sums of squares easily worked out and how does the sums of squares follow easily? Does it matter that certain parameters of the model are not estimated. R appears to just give a sums of squares despite some of the parameters being non-estimable. > > Thank you > > > > > > On 14 July 2016 at 09:50, peter dalgaard <pdalgd at gmail.com> wrote: > I am not aware of a detailed documentation of this beyond the actual source code. > However, the principles are fairly straightforward, except that the rules for constructing the design matrix from the model formula can be a bit arcane at times. > > The two main tools are the design matrix constructor (model.matrix) and a Gram-Schmidt type ortogonalization of its columns (the latter is called a QR decomposition in R, which it is, but there are several algorithms for QR, and the linear models codes depend on the QR algorithm being based on orthogonalization - so LINPACK works and LAPACK doesn't). > > An ortogonalized model matrix generates a decomposition of the model space into orthogonal subspaces corresponding to the terms of the model. Projections onto each of the subspaces are easily worked out. E.g., for a two-way analysis (Y~row+col+row:col) you can decompose the model effect as a row effect, a column effect, and an interaction effect. This allows straightforward calculation of the sums of squares of the ANOVA table. As you are probably well aware, if the design is unbalanced, the results will depend on the order of terms -- col + row + col:row gives a different result. > > What aov() does is that it first decomposes the observations according to the Error() term, forming the error strata, then fits the systematic part of the model to each stratum in turn. In the nice cases, each term of the model will be estimable in exactly one stratum, and part of the aov() logic is to detect and remove unestimable terms. E.g., if you have a balanced two way layout, say individual x treatment, the variable gender is a subfactor of individual, so Y ~ gender * treatment + Error(individual/treatment), the gender effect is estimated in the individual stratum, whereas treatment and gender:treatment are estimated in the individual:treatment stratum. > > It should be noted that it is very hard to interpret the results of aov() unless the Error() part of the model corresponds to a balanced experimental design. Or put more sharply: The model implied by the decomposition into error strata becomes nonsensical otherwise. If you do have a balanced design, the error strata reduce to simple combinations of means and observation, so the aov() algorithm is quite inefficient, but to my knowledge nobody has bothered to try and do better. > > -pd > > > On 13 Jul 2016, at 18:18 , Justin Thong <justinthong93 at gmail.com> wrote: > > > > Hi > > > > *I have been looking for a reference to explain how R uses the aov > > command(at a deeper level)*. More specifically, how R reads the formulae > > and R computes the sums of squares. I am not interested in understanding > > what the difference of Type 1,2,3 sum of squares are. I am more interested > > in finding out about how R computes ~x1:x2:x3 or how R computes ~A:x1 > > emphasizing sequential nature of the way it computes, and models even more > > complicated than this. > > > > Yours sincerely, > > Justin > > > > *I check my email at 9AM and 4PM everyday* > > *If you have an EMERGENCY, contact me at +447938674419(UK) or > > +60125056192(Malaysia)* > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > > > > > > -- > Yours sincerely, > Justin > > I check my email at 9AM and 4PM everyday > If you have an EMERGENCY, contact me at +447938674419(UK) or +60125056192(Malaysia) > > > > > -- > Yours sincerely, > Justin > > I check my email at 9AM and 4PM everyday > If you have an EMERGENCY, contact me at +447938674419(UK) or +60125056192(Malaysia) >