Kjetil Kjernsmo
2013-Mar-06 10:46 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
All, I have just returned to R after a decade of absence, and it is good to see that R has become such a great success! I'm trying to bring Design of Experiments into some aspects of software performance evaluation, and to teach myself that, I picked up "Experiments: Planning, Analysis and Optimization" by Wu and Hamada. I try to reproduce an analysis in the book using lm, but have to conclude I don't understand what lm does in this context, even though I end up at the desired result. I'm currently using R 2.15.2 on a recent Fedora system, but I get the same result on Debian Wheezy and Debian Squeeze. I think the discussion below can be followed without having the book at hand though. I'm working with tables 5.2 and 5.5 in the above mentioned book. Table 5.2 contains data from the "Leaf spring experiment". The dataset is also in this zip file: ftp://ftp.wiley.com/public/sci_tech_med/experiments-planning/data%20sets.zip I've learned from the book that the effects can be found using a linear model and double the coefficients. So, I do > leaf <- read.table("/ifi/bifrost/a03/kjekje/fag/experimental-planning/book-datasets/LeafSpring table 5.2.dat", col.names=c("B", "C", "D", "E", "Q", paste("r", 1:3, sep=""), "yavg", "ssq", "lnssq")) > leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf) > leaf.lm Call: lm(formula = yavg ~ B * C * D * E * Q, data = leaf) Coefficients: (Intercept) B+ C+ D+ E+ 7.54000 0.07003 0.32333 -0.09668 0.07668 Q+ B+:C+ B+:D+ C+:D+ B+:E+ -0.33670 0.01335 0.11995 0.02335 NA C+:E+ D+:E+ B+:Q+ C+:Q+ D+:Q+ NA NA 0.22915 -0.25745 0.28255 E+:Q+ B+:C+:D+ B+:C+:E+ B+:D+:E+ C+:D+:E+ 0.05415 NA NA NA NA B+:C+:Q+ B+:D+:Q+ C+:D+:Q+ B+:E+:Q+ C+:E+:Q+ 0.04160 -0.16160 -0.18840 NA NA D+:E+:Q+ B+:C+:D+:E+ B+:C+:D+:Q+ B+:C+:E+:Q+ B+:D+:E+:Q+ NA NA NA NA NA C+:D+:E+:Q+ B+:C+:D+:E+:Q+ NA NA (seems there is little I can do about the line breaks here, sorry) However, the book (table 5.5), has 0.221 for the main effect of B and 0.176, and the above is neither this, nor half of it. Now, I can reproduce what's in the book with > lm(yavg ~ B, data=leaf) Call: lm(formula = yavg ~ B, data = leaf) Coefficients: (Intercept) B+ 7.5254 0.2213 > lm(yavg ~ C, data=leaf) Call: lm(formula = yavg ~ C, data = leaf) Coefficients: (Intercept) C+ 7.5479 0.1763 Assuming lm does in fact double the coefficient in this case, but here the intercept varies, which doesn't seem correct, nor can I as trivially find the interactions the same way. Now, I try the effects() function, and get familiar numbers: > effects(leaf.lm) (Intercept) B+ C+ D+ E+ Q+ -30.54415 -0.44250 0.35250 -0.05750 -0.20750 -0.51920 B+:C+ B+:D+ C+:D+ B+:Q+ C+:Q+ D+:Q+ -0.03415 -0.03915 0.07085 -0.16915 0.33085 -0.10755 E+:Q+ B+:C+:Q+ B+:D+:Q+ C+:D+:Q+ 0.05415 -0.02080 0.08080 -0.09420 and indeed, I have verified that effects(leaf.lm)/2 gives me the expected result. So, I have found the correct answer, but I don't understand why. I have read the documentation for effects() as well as looked through the relevant chapter in "Statistical Models in S", but from that all I got was that I suppose there is a hint in the phrase "the effects are the uncorrelated single-degree-of-freedom", and that is somewhat different from the coefficients, but I can't make out from the book (Wu & Hamada) why the coefficients should be any different than the effects, to the contrary, it is quite clear from equation (5.8) in the book that the coefficients they use are effects(leaf.lm)/4. So, there are at least two points of confusion here, one is how coef() differs from effects() in the case of fractional factorial experiments, and the other is the factor 1/4 between the coefficients used by Wu & Hamada and the values returned by effects() as I would think from theory I've read that it should be a factor 2. Best regards, Kjetil -- Kjetil Kjernsmo PhD Research Fellow, University of Oslo, Norway Semantic Web / SPARQL Query Federation kjekje at ifi.uio.no http://www.kjetil.kjernsmo.net/
Ben Bolker
2013-Mar-06 13:50 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
Kjetil Kjernsmo <kjekje <at> ifi.uio.no> writes:> > All, > > I have just returned to R after a decade of absence, and it is good to > see that R has become such a great success! I'm trying to bring Design > of Experiments into some aspects of software performance evaluation, and > to teach myself that, I picked up "Experiments: Planning, Analysis and > Optimization" by Wu and Hamada. I try to reproduce an analysis in the > book using lm, but have to conclude I don't understand what lm does in > this context, even though I end up at the desired result. I'm currently > using R 2.15.2 on a recent Fedora system, but I get the same result on > Debian Wheezy and Debian Squeeze. I think the discussion below can be > followed without having the book at hand though.Just a quick thought (sorry for removing context): what happens if you use sum-to-zero contrasts throughout, i.e. options(contrasts=c("contr.sum", "contr.poly")) ... ?
Ista Zahn
2013-Mar-06 14:17 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
Hi, On Wed, Mar 6, 2013 at 5:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:> All, > > I have just returned to R after a decade of absence, and it is good to see > that R has become such a great success! I'm trying to bring Design of > Experiments into some aspects of software performance evaluation, and to > teach myself that, I picked up "Experiments: Planning, Analysis and > Optimization" by Wu and Hamada. I try to reproduce an analysis in the book > using lm, but have to conclude I don't understand what lm does in this > context, even though I end up at the desired result. I'm currently using R > 2.15.2 on a recent Fedora system, but I get the same result on Debian Wheezy > and Debian Squeeze. I think the discussion below can be followed without > having the book at hand though.I have my doubts...> > I'm working with tables 5.2 and 5.5 in the above mentioned book. Table 5.2 > contains data from the "Leaf spring experiment". The dataset is also in this > zip file: > > ftp://ftp.wiley.com/public/sci_tech_med/experiments-planning/data%20sets.zip > > I've learned from the book that the effects can be found using a linear > model and double the coefficients. So, I do >> leaf <- >> read.table("/ifi/bifrost/a03/kjekje/fag/experimental-planning/book-datasets/LeafSpring >> table 5.2.dat", col.names=c("B", "C", "D", "E", "Q", paste("r", 1:3, >> sep=""), "yavg", "ssq", "lnssq")) >> leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf)That is complete nonsense:> dim(leaf)[1] 16 11> length(coef(leaf.lm))[1] 32 So you are trying to estimate 32 coefficients from 16 data points. That is never going to work.>> leaf.lm > > Call: > lm(formula = yavg ~ B * C * D * E * Q, data = leaf) > > Coefficients: > (Intercept) B+ C+ D+ E+ > 7.54000 0.07003 0.32333 -0.09668 0.07668 > Q+ B+:C+ B+:D+ C+:D+ B+:E+ > -0.33670 0.01335 0.11995 0.02335 NA > C+:E+ D+:E+ B+:Q+ C+:Q+ D+:Q+ > NA NA 0.22915 -0.25745 0.28255 > E+:Q+ B+:C+:D+ B+:C+:E+ B+:D+:E+ C+:D+:E+ > 0.05415 NA NA NA NA > B+:C+:Q+ B+:D+:Q+ C+:D+:Q+ B+:E+:Q+ C+:E+:Q+ > 0.04160 -0.16160 -0.18840 NA NA > D+:E+:Q+ B+:C+:D+:E+ B+:C+:D+:Q+ B+:C+:E+:Q+ B+:D+:E+:Q+ > NA NA NA NA NA > C+:D+:E+:Q+ B+:C+:D+:E+:Q+ > NA NA > > (seems there is little I can do about the line breaks here, sorry) > > However, the book (table 5.5), has 0.221 for the main effect of B and 0.176, > and the above is neither this, nor half of it. Now, I can reproduce what's > in the book with > >> lm(yavg ~ B, data=leaf) > > Call: > lm(formula = yavg ~ B, data = leaf) > > Coefficients: > (Intercept) B+ > 7.5254 0.2213> >> lm(yavg ~ C, data=leaf) > > Call: > lm(formula = yavg ~ C, data = leaf) > > Coefficients: > (Intercept) C+ > 7.5479 0.1763 > > Assuming lm does in fact double the coefficient in this case,I have no idea what this means. but here the> intercept varies, which doesn't seem correct,You mean that the intercept for lm(yavg ~ B, data=leaf) differs from the intercept for lm(yavg ~ C, data=leaf) ? If so that is expected. The intercept is the expected value of yavg when all predictors are zero. The expected value for B = zero does not have to be the same as the expected value for C = 0. nor can I as trivially find> the interactions the same way.What way?> > Now, I try the effects() function, and get familiar numbers: >> effects(leaf.lm) > (Intercept) B+ C+ D+ E+ Q+ > -30.54415 -0.44250 0.35250 -0.05750 -0.20750 -0.51920 > B+:C+ B+:D+ C+:D+ B+:Q+ C+:Q+ D+:Q+ > -0.03415 -0.03915 0.07085 -0.16915 0.33085 -0.10755 > E+:Q+ B+:C+:Q+ B+:D+:Q+ C+:D+:Q+ > 0.05415 -0.02080 0.08080 -0.09420 > > and indeed, I have verified that effects(leaf.lm)/2 gives me the expected > result. > > So, I have found the correct answer, but I don't understand why. I have read > the documentation for effects() as well as looked through the relevant > chapter in "Statistical Models in S", but from that all I got was that I > suppose there is a hint in the phrase "the effects are the uncorrelated > single-degree-of-freedom", and that is somewhat different from the > coefficients, but I can't make out from the book (Wu & Hamada) why the > coefficients should be any different than the effects, to the contrary, it > is quite clear from equation (5.8) in the book that the coefficients they > use are effects(leaf.lm)/4. > > So, there are at least two points of confusion here, one is how coef() > differs from effects() in the case of fractional factorial experiments, and > the other is the factor 1/4 between the coefficients used by Wu & Hamada and > the values returned by effects() as I would think from theory I've read that > it should be a factor 2.I don't know the answers to these questions (I don't really understand the effects function). So I'm afraid all I've done is add more questions, i.e., why in the world are we estimating 32 coefficients from 16 points? Best, Ista> > Best regards, > > Kjetil > -- > Kjetil Kjernsmo > PhD Research Fellow, University of Oslo, Norway > Semantic Web / SPARQL Query Federation > kjekje at ifi.uio.no http://www.kjetil.kjernsmo.net/ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Kjetil Kjernsmo
2013-Mar-06 14:48 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
On 03/06/2013 02:50 PM, Ben Bolker wrote:> Just a quick thought (sorry for removing context): what happens if > you use sum-to-zero contrasts throughout, i.e. options(contrasts=c("contr.sum", > "contr.poly")) ... ?That works (except for the sign)! What would this mean? Kjetil
Peter Claussen
2013-Mar-06 15:18 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
On Mar 6, 2013, at 4:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:> All, > > I have just returned to R after a decade of absence, and it is good to see that R has become such a great success! I'm trying to bring Design of Experiments into some aspects of software performance evaluation, and to teach myself that, I picked up "Experiments: Planning, Analysis and Optimization" by Wu and Hamada. I try to reproduce an analysis in the book using lm, but have to conclude I don't understand what lm does in this context, even though I end up at the desired result. I'm currently using R 2.15.2 on a recent Fedora system, but I get the same result on Debian Wheezy and Debian Squeeze. I think the discussion below can be followed without having the book at hand though. > > I'm working with tables 5.2 and 5.5 in the above mentioned book. Table 5.2 contains data from the "Leaf spring experiment". The dataset is also in this zip file: > > ftp://ftp.wiley.com/public/sci_tech_med/experiments-planning/data%20sets.zip > > I've learned from the book that the effects can be found using a linear model and double the coefficients. So, I do > > leaf <- read.table("/ifi/bifrost/a03/kjekje/fag/experimental-planning/book-datasets/LeafSpring table 5.2.dat", col.names=c("B", "C", "D", "E", "Q", paste("r", 1:3, sep=""), "yavg", "ssq", "lnssq")) > > leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf) > > leaf.lm >I'll ignore the rest of your question, in the hope that this will answer them sufficiently. You probably want a simple linear model, specified in R using "+" instead of "*".> leaf.lm <- lm(yavg ~ B + C + D + E + Q, data=leaf) > leaf.lmCall: lm(formula = yavg ~ B + C + D + E + Q, data = leaf) Coefficients: (Intercept) B+ C+ D+ E+ Q+ 7.50084 0.22125 0.17625 0.02875 0.10375 -0.25960 Does this give you the numbers you expect? Peter> > > Kjetil > -- > Kjetil Kjernsmo > PhD Research Fellow, University of Oslo, Norway > Semantic Web / SPARQL Query Federation > kjekje at ifi.uio.no http://www.kjetil.kjernsmo.net/ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Peter Claussen
2013-Mar-06 15:20 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
On Mar 6, 2013, at 4:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:> All, > > I have just returned to R after a decade of absence, and it is good to see that R has become such a great success! I'm trying to bring Design of Experiments into some aspects of software performance evaluation, and to teach myself that, I picked up "Experiments: Planning, Analysis and Optimization" by Wu and Hamada. I try to reproduce an analysis in the book using lm, but have to conclude I don't understand what lm does in this context, even though I end up at the desired result. I'm currently using R 2.15.2 on a recent Fedora system, but I get the same result on Debian Wheezy and Debian Squeeze. I think the discussion below can be followed without having the book at hand though. > > I'm working with tables 5.2 and 5.5 in the above mentioned book. Table 5.2 contains data from the "Leaf spring experiment". The dataset is also in this zip file: > > ftp://ftp.wiley.com/public/sci_tech_med/experiments-planning/data%20sets.zip > > I've learned from the book that the effects can be found using a linear model and double the coefficients. So, I do > > leaf <- read.table("/ifi/bifrost/a03/kjekje/fag/experimental-planning/book-datasets/LeafSpring table 5.2.dat", col.names=c("B", "C", "D", "E", "Q", paste("r", 1:3, sep=""), "yavg", "ssq", "lnssq")) > > leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf) > > leaf.lm > >I'll ignore the rest of your question, in the hope that this will answer them sufficiently. You probably want a simple linear model, specified in R using "+" instead of "*".> leaf.lm <- lm(yavg ~ B + C + D + E + Q, data=leaf) > leaf.lmCall: lm(formula = yavg ~ B + C + D + E + Q, data = leaf) Coefficients: (Intercept) B+ C+ D+ E+ Q+ 7.50084 0.22125 0.17625 0.02875 0.10375 -0.25960 Does this give you the numbers you expect? Peter> > Best regards, > > Kjetil > -- > Kjetil Kjernsmo > PhD Research Fellow, University of Oslo, Norway > Semantic Web / SPARQL Query Federation > kjekje at ifi.uio.no http://www.kjetil.kjernsmo.net/ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Kjetil Kjernsmo
2013-Mar-06 15:23 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
On 03/06/2013 04:18 PM, Peter Claussen wrote:> I'll ignore the rest of your question, in the hope that this will answer them sufficiently.OK!> You probably want a simple linear model, specified in R using "+" instead of "*". > >> >leaf.lm <- lm(yavg ~ B + C + D + E + Q, data=leaf) >> >leaf.lm > Call: > lm(formula = yavg ~ B + C + D + E + Q, data = leaf) > > Coefficients: > (Intercept) B+ C+ D+ E+ Q+ > 7.50084 0.22125 0.17625 0.02875 0.10375 -0.25960 > > Does this give you the numbers you expect?Well, it partly gives the numbers I expect, but I want the interactions as well, so it is only a partial answer. Best, Kjetil
Kjetil Kjernsmo
2013-Mar-07 13:45 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
On Wednesday 6. March 2013 14.50.23 Ben Bolker wrote:> Just a quick thought (sorry for removing context): what happens if > you use sum-to-zero contrasts throughout, i.e. > options(contrasts=c("contr.sum", "contr.poly")) ... ?Ah, I've got it now, this pointed me in the right direction. Thanks a lot! For future reference, the default was "contr.treatment", which implies> contrasts(leaf$B)+ - 0 + 1 using contr.sum, we get> contrasts(leaf$B)[,1] - 1 + -1 which is orthogonal, and also explains the switched sign. Kjetil
S Ellison
2013-Mar-07 14:32 UTC
[R] Understanding lm-based analysis of fractional factorial experiments
> So, there are at least two points of confusion here, one is > how coef() differs from effects() in the case of fractional > factorial experiments, and the other is the factor 1/4 > between the coefficients used by Wu & Hamada and the values > returned by effects() as I would think from theory I've read > that it should be a factor 2.Some observations to throw into the mix here. First, the model.> leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf)is indeed unnecessarily complicated. You can request all second order interactions (accepting that some will be NA) with> leaf.lm <- lm(yavg ~ ( B + C + D + E + Q)^2, data=leaf)or of course you can specify only the ones you know will be present:> leaf.lm <- lm(yavg ~ (B + C + D + E + Q + B:C + B:D + B:E + B:Q + C:Q + D:Q + E:Q, data=leaf)I cheated a bit:> leaf.avg <- leaf[, c(1:5, 9)] > leaf.lm.avg <- lm(yavg ~ ( .)^2, data=leaf.avg)#less typing, same model (with NAs for aliased interactions) Now lets look at those effects. First, look at the model.matrix for leaf.lm.>model.matrix(leaf.lm)This shows you what your yavg is actually being regressed against - it's a matrix of dummy codings for your (character) levels "+" and "-". Notice the values for B (under "B+") and the intercept. The first row (which has "-" for B in the leaf data set) has a zero coefficient; the second has a 1. So in this matrix, 1 corresponds to "+" and "-" corresponds to 0, which you can read as "not different from the intercept". This arises from teh way contrasts have been chosen; these arise from 'treatment contrasts', R's default for factors. Pretty much the same goes for all the other factors. This structure corresponds to measuring the 'alternate' level of every factor - always "+" because the first level is "-" - against the intercept. What the intercept represents is not immediately obvious if you;re used to the mean of the study as the intercept; it estimates the when all factors are in their 'zero' state. That's really useful if you want to test a series of alternate treatments against a control. It's also R's default. But it's not what most industrial experiments based on 2-level fractional factorials want; they usually want differences between levels. That is why some folk would call R's defaults 'odd defaults'. To get something closer to the usual fractional factorial usage, you need to change those dummy codings from (0,1) for "-", "+" to something like +1, -1. You can change that using contrasts. To change the contrasts to something most industrial experimenters would expect, we use sum-to-zero contrasts. So we do this instead (using my leaf.avg data frame above):.> options(contrasts=c("contr.sum", "contr.poly")) > (leaf.lm.sum <- lm(yavg ~ (.)^2, data=leaf.avg))Now we can see that the coefficient for B is indeed half the difference between levels, as you might have expected. And the intercept is now equal to the mean of the data, as you might expect from much of the industrial experiment design literature. So - why half, and why is it negative? Look at the new model matrix:> model.matrix(leaf.lm.sum)This time, B is full of +1 and -1, instead of 0 and 1. So first, B+ and B- are both different (in equal and opposite directions) from the intercept. Second, the range allocated to B is (+1 - -1) = 2, so the change in yavg per _unit_ change in B (the lm coefficient for B)- is half the difference between levels for B. Finally, look at the particular allocation of model matrix values for B. The first row has +1, the second row -1. That's because contr.sum has allocated +1 to the first level of the factor B, which (if you look at levels(leaf$B) is "-" because factor() uses a default alphabetic order. (You could change that for all your factors if you wanted, for example with leaf$B <- factor(leaf$B, levels=c("+","-")) and you'd then have +1 for "+" and -1 for "-") . But in the mean time, lm has been told that "-" corresponds to an _increase_ in the dummy coding for B. Since the mean for B=="-" is lower than the intercept, you get a negative coefficient. effects() is doing somethig quite different; the help page tells you what its doing mathematically. It has an indirect physical interpretation: for simple designs like this, effects() output for the coefficients comes out as the coefficient multiplied by sqrt(n), where n is the number of observations in the experiment. But it is not giving you the effect of a factor in the sense of an effect of change in factor level on mean response. Hope some of that is useful to you. S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}