Andrew Crane-Droesch
2013-Mar-19 00:08 UTC
[R] How to automate this model selection algorithm?
I've got a complicated semi-parametric model that I'm fitting with mgcv. I start with a model based on theory. Its got lots of interaction terms. I want to winnow it down: removing each interaction term or un-interacted main effect one by one, checking the AIC, and retaining the model that gives me the lowest AIC. I then want to repeat the procedure on the retained model. Here is a simple example: set.seed(42) library(mgcv) N=220 fac = ceiling(runif(N)*2) a = rnorm(N); b = rnorm(N); c = rnorm(N); d = runif(N); e = rnorm(N); y = a^fac + b*e + d/(e+1) m1 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b,c) + te(a,d,by=as.factor(fac)) ) m2 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(b,c) + te(a,d,by=as.factor(fac)) ) m3 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,c) + te(a,d,by=as.factor(fac)) ) m4 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b) + te(a,d,by=as.factor(fac)) ) m5 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b,c) + te(d,by=as.factor(fac)) ) m6 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b,c) + te(a,by=as.factor(fac)) ) m7 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b,c) + te(a,d) ) selection = AIC(m1,m2,m3,m4,m5,m6,m7) selection df AIC m1 14.53435 1626.611 m2 12.52501 1622.635 m3 12.54566 1622.615 m4 12.52652 1622.633 m5 13.14108 1620.759 m6 10.99684 1621.156 m7 10.98136 1622.229 m5 is the best m5 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b,c) + te(d,by=as.factor(fac)) ) m5.1 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(b,c) + te(d,by=as.factor(fac)) ) m5.2 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,c) + te(d,by=as.factor(fac)) ) m5.3 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b) + te(d,by=as.factor(fac)) ) m5.4 = gam(y~ as.factor(fac) + s(a) + s(b) + s(c) + s(d) + te(a,b,c) #+ te(d,by=as.factor(fac)) ) selection.2 = AIC(m5,m5.1,m5.2,m5.3,m5.4) selection.2 df AIC m5 13.363029 1621.183 m5.1 9.671656 1617.641 m5.2 9.730047 1617.549 m5.3 9.706424 1617.569 m5.4 9.857504 1620.028 5.2 is the best ...and so on. I'd next try out each interaction term or un-interacted main effect in m5.2. Question is how I can automate this? To start with a model (m1, in my case), and have R give me the best model after running this algorithm to the point where there are no longer any "better" models according to this algorithm? Right now I can do it manually, but as I add data over time, model selection may change. Note the mgcv's "select=TRUE" functionality doesn't quite work for me (at least not directly), because I want to see if single parts of three-way interactions can/should be removed. Thanks in advance for any tips, and apologies for cross-posting (on stack overflow). Best, Andrew
Andrew Crane-Droesch
2013-Mar-19 00:28 UTC
[R] How to automate this model selection algorithm?
Just to add, I've been playing around with "select=TRUE" in mgcv, and it does seem that it could work if I were to specify all of the nested two-way interactions in my three-way interactions (see the toy example below). But the problem is that I don't have enough degrees of freedom to feed such a model into GAM using my main dataset. N=200 a = rnorm(N) b = rnorm(N) c = rnorm(N) y = rnorm(N)+a+b+c+a*b m = gam(y~s(a)+s(b)+s(c)+te(a,b)+te(a,b,c)) msel = gam(y~s(a)+s(b)+s(c)+te(a,b)+te(a,b,c),select=TRUE) mdrop = gam(y~s(a)+s(b)+s(c)+te(a,b)) summary(m) summary(msel) summary(mdrop) plot(density(m$fitted.values)) lines(density(msel$fitted.values),col="red") lines(density(mdrop$fitted.values),col="blue")