Hassan, Nazatulshima
2016-Mar-18 16:00 UTC
[R] variable selection using residual difference
I have the following example dataset set.seed(2001) n <- 100 Y <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) X1 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.1,0.4,0.5)) X2 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.5,0.25,0.25)) X3 <- c(0,2,2,2,2,2,2,2,0,2,0,2,2,0,0,0,0,0,2,0,0,2,2,0,0,2,2,2,0,2,0,2,0,2,1,2,1,1,1,1,1,1,1,1,1,1,1,0,1,2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0) dat <- data.frame(Y,X1,X2,X3) I fit a logistic regression model to each of the variable to rank them based on the residual difference (highest to lowest). To simplify I got the rank as X3,X1 and X2. Then, I fit a second order model as follows and again calculate the res_dif : mod1 <- glm(Y~X3+X1, family=binomial, data=dat) mod1$null.deviance-mod1$deviance mod2 <- glm(Y~X3+X2, family=binomial,data=dat) mod2$null.deviance-mod2$deviance Again, I will rank the model based on res_dif (highest to lowest). So here, I choose mod2. From there I will fit the third order model as follows : mod3 <- glm(Y~X3+X2+X1, family=binomial, data=dat) mod3$null.deviance-mod3$deviance Basically, this continues until it fits the maximum number of variables that you have in the data. My aim is to do variable selection based on res_dif instead of AIC, BIC or R2. Since my actual dataset is dealing with 100 of variables, I wonder how can I apply this using loop function. Any suggestions would be appreciated. Kind Regards Shima [[alternative HTML version deleted]]
Suggestion: Don't do this! I suggest that you consult with a local statistician or post to a statistical website like stats.stackexchange.com for what might be sensible procedures for variable selection (a complex and controversial topic!) and why what you propose is or is not a good idea (don't trust me!). Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Mar 18, 2016 at 9:00 AM, Hassan, Nazatulshima <Nazatulshima.Hassan at liverpool.ac.uk> wrote:> I have the following example dataset > set.seed(2001) > n <- 100 > Y <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) > X1 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.1,0.4,0.5)) > X2 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.5,0.25,0.25)) > X3 <- c(0,2,2,2,2,2,2,2,0,2,0,2,2,0,0,0,0,0,2,0,0,2,2,0,0,2,2,2,0,2,0,2,0,2,1,2,1,1,1,1,1,1,1,1,1,1,1,0,1,2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0) > > dat <- data.frame(Y,X1,X2,X3) > > I fit a logistic regression model to each of the variable to rank them based on the residual difference (highest to lowest). To simplify I got the rank as X3,X1 and X2. Then, I fit a second order model as follows and again calculate the res_dif : > mod1 <- glm(Y~X3+X1, family=binomial, data=dat) > mod1$null.deviance-mod1$deviance > mod2 <- glm(Y~X3+X2, family=binomial,data=dat) > mod2$null.deviance-mod2$deviance > > Again, I will rank the model based on res_dif (highest to lowest). So here, I choose mod2. From there I will fit the third order model as follows : > mod3 <- glm(Y~X3+X2+X1, family=binomial, data=dat) > mod3$null.deviance-mod3$deviance > > Basically, this continues until it fits the maximum number of variables that you have in the data. > My aim is to do variable selection based on res_dif instead of AIC, BIC or R2. Since my actual dataset is dealing with 100 of variables, I wonder how can I apply this using loop function. > > Any suggestions would be appreciated. > > Kind Regards > Shima > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.