Jenifer Larson-Hall
2006-Nov-07 03:05 UTC
[R] Comparing models in multiple regression and hierarchical linear regression
I don?t know if this question properly belongs on this list, but I?ll ask it here because I?ve been using R to run linear regression models, and it is only in using R (after switching from using SPSS) that I have discovered the process of fitting a linear model. However, after reading Crowley (2002), Fox (2002), Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I cannot find an answer to my question. I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one response variable (G1L1WR). A simple main effects model finds that only PA is statistically significant, and an anova comparison between a 5-variable main effects model and a 1-variable main effects model finds no difference between the models. So it is possible to simplify the model to just G1L1WR ~ PA. This leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and an adjusted R2 of 0.552. I also decided, following Crawley?s (2002) advice, to create a maximal model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a stepAIC through the model revealed the model which had a maximal fit: maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR + NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA + NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM + NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data = lafrance.NoNA) All of the terms of this model have statistical t-tests, the residual standard error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An anova shows a clear difference between the simplified model and the maximal fit model. My question is, should I really pick the maximal fit over the simple model when it is really so much harder to understand? I guess there?s really no easy answer to that, but if that?s so, then my question is?would there be anything wrong with me saying that sometimes you might value parsimony and ease of understanding over best fit? Because I don?t really know what the maximal fit model buys you. It seems unintelligible to me. All of the terms are involved in interactions to some extent, but there are 4-way interactions and 3-way interactions and 2-way interactions and I?m not sure even how to understand it. A nice tree model showed that at higher levels of PA, KINDERWR and NS affected scores. That I can understand, but that is not reflected in this model. An auxiliary question, probably easier to answer, is how could I do hierarchical linear regression? The authors knew that PA would be the largest contributor to the response variable because of previous research, and their research question was whether PA would contribute anything AFTER the other 4 variables had already eaten their piece of the response variable pie. I know how to do a hierarchical regression in SPSS, and want to show in parallel how to do this in R. I did search R-help archives and didn?t find quite anything that would just plain tell me how to do hierarchical linear regression. Thanks in advance for any help. Dr. Jenifer Larson-Hall Assistant Professor of Linguistics University of North Texas
Spencer Graves
2006-Nov-08 16:54 UTC
[R] Comparing models in multiple regression and hierarchical linear regression
The questions you ask about the interactions in the model not making sense relates, I believe, to a multiple comparisons issue that is not adequately addressed by the stepAIC analysis you did. To understand this, note first that you've got something close to 2^(2^5) possible models: With 5 main effects, there area 2^5 = 32 possible subsets = "interaction" terms. The function stepAIC tries to select from among models representing all possible subsets of these 2^5 "interaction" terms. Just for clarity on this point, suppose we had 2 main effects, A and B, not 5 as you have. Then there are 2^2 = 4 possible "interaction" terms: 1, A, B, and A:B. Then stepAIC would want to select the best model from among the 2^4 = 16 models consisting of these 4 terms plus all possible combinations of them. With 5 main effects, this set of possible models balloons to 2^32 = 4.3e9. Some of them don't make sense by themselves, e.g., A:B without A and B. However, I don't think that will reduce this 4.3e9 by more than an order of magnitude or so -- and maybe only a few percent. If we ignore this reduction, then the Bonferroni correction suggests we multiply all your p values by 4.3e9. If you still have any that are less than, say, 0.05, then you could believe those. Otherwise, I'd be skeptical. A better way of handling this, I believe is Bayesian Model Averaging. An R package BMA is supposed to address that, but I don't know if it will help you. Other questions: What are your 5 explanatory variables, NR, NS, PA, KINDERWR, and WM? In particular, are they numbers representing at least an ordinal scale or are they categories? If numbers, you've got possibilities for parabolic terms that you haven't even considered. If categories, how many levels are there? If some of them have large numbers of levels, you may want to consider those factors as random effects. In that case, you need something to do 'mixed-effects' modeling. For that, people use the 'nlme' or 'lme4' packages. Have you tried to find a statistician with whom you could consult or even possibly collaborate on this? Hope this helps. Spencer Graves Jenifer Larson-Hall wrote:> I don?t know if this question properly belongs on this list, but I?ll ask it here because I?ve been using R to run linear regression models, and it is only in using R (after switching from using SPSS) that I have discovered the process of fitting a linear model. However, after reading Crowley (2002), Fox (2002), Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I cannot find an answer to my question. > I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one response variable (G1L1WR). A simple main effects model finds that only PA is statistically significant, and an anova comparison between a 5-variable main effects model and a 1-variable main effects model finds no difference between the models. So it is possible to simplify the model to just G1L1WR ~ PA. This leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and an adjusted R2 of 0.552. > I also decided, following Crawley?s (2002) advice, to create a maximal model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a stepAIC through the model revealed the model which had a maximal fit: > > maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR + NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA + NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM + NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data = lafrance.NoNA) > > All of the terms of this model have statistical t-tests, the residual standard error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An anova shows a clear difference between the simplified model and the maximal fit model. My question is, should I really pick the maximal fit over the simple model when it is really so much harder to understand? I guess there?s really no easy answer to that, but if that?s so, then my question is?would there be anything wrong with me saying that sometimes you might value parsimony and ease of understanding over best fit? Because I don?t really know what the maximal fit model buys you. It seems unintelligible to me. All of the terms are involved in interactions to some extent, but there are 4-way interactions and 3-way interactions and 2-way interactions and I?m not sure even how to understand it. A nice tree model showed that at higher levels of PA, KINDERWR and NS affected scores. That I can understand, but that is not reflected in this model. > > An auxiliary question, probably easier to answer, is how could I do hierarchical linear regression? The authors knew that PA would be the largest contributor to the response variable because of previous research, and their research question was whether PA would contribute anything AFTER the other 4 variables had already eaten their piece of the response variable pie. I know how to do a hierarchical regression in SPSS, and want to show in parallel how to do this in R. I did search R-help archives and didn?t find quite anything that would just plain tell me how to do hierarchical linear regression. > > Thanks in advance for any help. > > Dr. Jenifer Larson-Hall > Assistant Professor of Linguistics > University of North Texas > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Andrew Robinson
2006-Nov-08 22:43 UTC
[R] Comparing models in multiple regression and hierarchical linear regression
Hello Jenifer, your question reflects some very interesting statistical problems, surrounding the effects of subset selection upon estimation and inference. There does not seem to be a right answer, as far as I am aware, so I will offer an opinion. I'd welcome further discussion. I don't have a copy of Crawley handy, so I can't speak for his recommendations. Firstly, I would check: are your regression assumptions satisfied for all the models under active consideration? If the regression assumptions aren't satisfied, then the tests that you might perform - even those that are used to compare between the models - can be compromised. Then, it really depends upon the use to which the model will be put. I think that the motivations of modellers can be placed on a spectrum, ranging from purely prediction at one end to purely estimation (interpretation) at the other. The modelling decisions that you make ought to depend on your motivations. I infer from your post that your goal is to at least partially to understand the model. Some authoritative statisticians claim that trying to interpret a main effect when you have significant interactions that include that effect is dangerous. This is because the knowledge of the interactions will always modify your interpretations of the main effects, even if only slightly. When you try to interpret main effects in the presence of interactions, without acknowledging those interactions, you're implicitly averaging across the interaction parameters. Whether or not this is meaningful, and therefore reasonable, depends on the underlying representativeness of the data, which in turn depends on the purpose of the model and the sample design (if there is one!). For example, if I'm interested in predicting tree height as a function of diameter, I know that the tree species (and age) will interact with diameter, because different species and ages of trees have different shapes. However, I will still happily assert that height increases with diameter, on average, and might even report and interpret the main effect (diameter) coefficient. Even though I know that such a report is averaging across the diameters and species in my sample, I'm comfortable that the coefficient holds meaning to the reader. I think that it would be pedantic to assert otherwise. On the other hand, at least one _other_ authoritative statistician I know claims that he will never bother to test a term that he can't interpret - he usually draws the line at three-way interactions. So, opinions are divided among the authorities. A part of the problem is reliance on statistical significance as a metric for model quality. If your sample size is large, then it's possible to detect signals in the data that don't have much practical import. But, it's clear from the rmse reduction that you report that there is some predictive heft somewhere in the higher interactions. But it may well be that only one or two of the curly interaction terms are really important. If that be the case, it would be important to find them. To do that, you might try fiddling with the stepAIC arguments to make it less sensitive, and comparing the rmse of the models you get with the models that you have. I would also be tempted to do this from the point of view of adjusting for the many tests that stepAIC performs. So, if you discover that a simpler model explains about as much of the underlying variation, then I think that life is easier. If no such simpler model exists, then interpretation is a problem. If you really can't face interpreting the complex model, then I would do the following. 1) report and interpret the simpler model, with the caveats noted above (about significant interaction terms) carefully outlined. 2) report the relevant fit statistics of the best stepAIC model. 3) consider reporting fit statistics of best stepAIC models constrained by order of interaction, e.g., best possible model constrained to 2-way, 3-way, etc interactions. (The last step provides the reader with a sense of the complexity of the hierarchical structure of the model.) On the question of hierarchical linear regression: it sounds to me like what you want to do is a whole model test of two models, one comprising the four variables and the other comprising the five variables. If I'm correct, then anova(pie.4, pie.5) will work. You will have to think about the ingredients of the pie models - should they include only main effects, or all interactions, or all statistically significant interactions (which I think has some bad implications for the frequentist properties of the test, someone please correct me if I'm wrong). You might also consider using an added-variable plot to summarize the utility of your fifth variable. Cheers Andrew On Mon, Nov 06, 2006 at 09:05:02PM -0600, Jenifer Larson-Hall wrote:> I don?t know if this question properly belongs on this list, but I?ll ask it here because I?ve been using R to run linear regression models, and it is only in using R (after switching from using SPSS) that I have discovered the process of fitting a linear model. However, after reading Crowley (2002), Fox (2002), Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I cannot find an answer to my question. > I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one response variable (G1L1WR). A simple main effects model finds that only PA is statistically significant, and an anova comparison between a 5-variable main effects model and a 1-variable main effects model finds no difference between the models. So it is possible to simplify the model to just G1L1WR ~ PA. This leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and an adjusted R2 of 0.552. > I also decided, following Crawley?s (2002) advice, to create a maximal model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a stepAIC through the model revealed the model which had a maximal fit: > > maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR + NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA + NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM + NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data = lafrance.NoNA) > > All of the terms of this model have statistical t-tests, the residual standard error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An anova shows a clear difference between the simplified model and the maximal fit model. My question is, should I really pick the maximal fit over the simple model when it is really so much harder to understand? I guess there?s really no easy answer to that, but if that?s so, then my question is?would there be anything wrong with me saying that sometimes you might value parsimony and ease of understanding over best fit? Because I don?t really know what the maximal fit model buys you. It seems unintelligible to me. All of the terms are involved in interactions to some extent, but there are 4-way interactions and 3-way interactions and 2-way interactions and I?m not sure even how to understand it. A nice tree model showed that at higher levels of PA, KINDERWR and NS affected scores. That I can understand, but that is not reflected in this model. > > An auxiliary question, probably easier to answer, is how could I do hierarchical linear regression? The authors knew that PA would be the largest contributor to the response variable because of previous research, and their research question was whether PA would contribute anything AFTER the other 4 variables had already eaten their piece of the response variable pie. I know how to do a hierarchical regression in SPSS, and want to show in parallel how to do this in R. I did search R-help archives and didn?t find quite anything that would just plain tell me how to do hierarchical linear regression. > > Thanks in advance for any help. > > Dr. Jenifer Larson-Hall > Assistant Professor of Linguistics > University of North Texas > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Andrew Robinson Department of Mathematics and Statistics Tel: +61-3-8344-9763 University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599 http://www.ms.unimelb.edu.au/~andrewpr http://blogs.mbs.edu/fishing-in-the-bay/