mijohnso at eos.ubc.ca
2013-Aug-07 16:58 UTC
[R] MOB (party package) Question - Variable Selection
Hi. I am a grad student and I'm currently using the MOB function in the R party package and I had a question. I am working on an environmental problem with about 100 predictors. I am having trouble determining which predictors to use for regression and which for partitioning, is there any sort of method to determine this? Does it cause problems if a variable is used for both regression and partitioning? I attempted to pre-screen the variables using stepwise linear regression and I used the selected variables for regression and all others for partitioning. However this lead to the model only having one node. Any suggestions would be very much appreciated, thanks. [[alternative HTML version deleted]]
Michael:> Hi. I am a grad student and I'm currently using the MOB function in the > R party package and I had a question. I am working on an environmental > problem with about 100 predictors. I am having trouble determining which > predictors to use for regression and which for partitioning, is there > any sort of method to determine this?That depends a little bit on what exactly you are trying to achieve. When we developed MOB, we had the following situation in mind: - You have some sort of data for which you know from the literature that a certain type of model works well. For example, log(y) ~ log(x1) + log(x2) or something like that. - But you also have data on a bunch of other variables that you don't know yet how they should enter the model. Often these are categorical variables or numerical variables that are not part of the standard theory. - Then MOB is one possible approach to check whether these additional variables affect the basic standard model or not. And by recursive partitioning you could capture various types of main and interaction effects. However, if you just have a response variable and a bunch of regressors where you don't have much prior knowledge. And you want to select both the relevant variables and their functional form, then MOB might help you but there might also be other methods that are more natural. For example, GAMs or boosting etc.> Does it cause problems if a variable is used for both regression and > partitioning?In principle, this is possible. Whether or not this is meaningful and/or easy to interpret depends on the particular data though.> I attempted to pre-screen the variables using stepwise linear regression > and I used the selected variables for regression and all others for > partitioning. However this lead to the model only having one node.That's not very surprising, is it? You already tried to capture the potential influence of all regressors on your response. Of course, MOB might have turned up a few additional interactions but I'm not surprised if it doesn't. We've obtained the most useful results when the basic model had relatively few parameters and was easy/natural to interpret. Hope that helps, Z