Tal Galili
2009-Mar-11 21:56 UTC
[R] regsubsets() [leaps package] - please share some good examples of use
Hello dear R-help members, I recently became interested in using biglm with leaps, and found myself somewhat confused as to how to use the two together, in different settings. I couldn't find any example codes for the leaps() package (except for in the help file, and the examples there are not as rich as they could be). That is why I turn to you in case you could share some good tips and examples of code on how to use the leaps package (especially the regsubsets command) The problem that drives me to ask this is: how to use the regsubsets() command to immulate a forward model selection procedure on a regressions problem ? I attach below a few direction dear Thomas has already wrote to me on the subject, and any help would be very welcomed: *me:* I feel I am missing a big something here, so please help me here - Let's say we have a dataset with an X matrix of 10 variables, and all we want to perform is forward variable selection with AIC, starting from the minimal model that includes the intercept only, and with the maximum model of all variable and their interaction up to the second order. In that range, we wish to find the best model, based on forward selection. *Thomas:* Use biglm() to fit the model with all main effects and all second order interactions. This model will be the maximum model for selection. The minimum model, by default, is the model with only an intercept, so you don't need to specify anything. If the minimum model is more complicated, the vector force.in specifies which terms are in the minimum model (a logical vector with TRUE for variables in the minimum model and FALSE for variables not in the minimum model). regsubsets() will give you the best model with one variable, the best with two variables, and so on. The object produced by summary() of the regsubsets() has a component $cp that gives Mallows' Cp for each of the best models. This is equivalent to AIC, or you can compute AIC from the residual sum of squares in the $rss component of the object. regsubsets() doesn't actually fit the models, it just works out the residual sum of squares. You need to take the output of regsubsets() and then fit which ever of the best models you want coefficients for. summary(regsubsets.object)$which is a logical matrix indicating which variables are in each of the best models. This may seem unnecessarily complicated, but regsubsets() was designed for situations where you want lots of best models rather than just one, since there are often lots of models that are about equally good. That's the point of the plot() method, where you can look at hundreds of best models from 30 or so variables and see which variables are in most of the good models, and which variables tend to occur together or separately -- for example, if you have two related variables such as systolic blood pressure and diastolic blood pressure do they substitute for each other or do they tend to occur in the same model. Thanks all (and again - thanks Thomas for all your patient answers so far) Tal ---------------------------------------------- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: www.talgalili.com www.biostatistics.co.il [[alternative HTML version deleted]]
Tal Galili
2009-Mar-12 12:21 UTC
[R] regsubsets() [leaps package] - please share some good examples of use
Hello dear R-help members, I recently became interested in using biglm with leaps, and found myself somewhat confused as to how to use the two together, in different settings. I couldn't find any example codes for the leaps() package (except for in the help file, and the examples there are not as rich as they could be). That is why I turn to you in case you could share some good tips and examples of code on how to use the leaps package (especially the regsubsets command) The problem that drives me to ask this is: how to use the regsubsets() command to immulate a forward model selection procedure on a regressions problem ? I attach below a few direction dear Thomas has already wrote to me on the subject, and any help would be very welcomed: *me:* I feel I am missing a big something here, so please help me here - Let's say we have a dataset with an X matrix of 10 variables, and all we want to perform is forward variable selection with AIC, starting from the minimal model that includes the intercept only, and with the maximum model of all variable and their interaction up to the second order. In that range, we wish to find the best model, based on forward selection. *Thomas:* Use biglm() to fit the model with all main effects and all second order interactions. This model will be the maximum model for selection. The minimum model, by default, is the model with only an intercept, so you don't need to specify anything. If the minimum model is more complicated, the vector force.in specifies which terms are in the minimum model (a logical vector with TRUE for variables in the minimum model and FALSE for variables not in the minimum model). regsubsets() will give you the best model with one variable, the best with two variables, and so on. The object produced by summary() of the regsubsets() has a component $cp that gives Mallows' Cp for each of the best models. This is equivalent to AIC, or you can compute AIC from the residual sum of squares in the $rss component of the object. regsubsets() doesn't actually fit the models, it just works out the residual sum of squares. You need to take the output of regsubsets() and then fit which ever of the best models you want coefficients for. summary(regsubsets.object)$which is a logical matrix indicating which variables are in each of the best models. This may seem unnecessarily complicated, but regsubsets() was designed for situations where you want lots of best models rather than just one, since there are often lots of models that are about equally good. That's the point of the plot() method, where you can look at hundreds of best models from 30 or so variables and see which variables are in most of the good models, and which variables tend to occur together or separately -- for example, if you have two related variables such as systolic blood pressure and diastolic blood pressure do they substitute for each other or do they tend to occur in the same model. Thanks all (and again - thanks Thomas for all your patient answers so far) Tal p.s: I already sent this e-mail once, but couldn't seem to see it on the list, so I resent it again - sorry if any of you got it twice. ---------------------------------------------- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: www.talgalili.com www.biostatistics.co.il [[alternative HTML version deleted]]