thr3ads.net - R help - [R] variable selection using residual difference [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Hassan, Nazatulshima

2016-Mar-18 16:00 UTC

[R] variable selection using residual difference

I have the following example dataset
set.seed(2001)
n <- 100
Y <-
c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
X1 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.1,0.4,0.5))
X2 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.5,0.25,0.25))
X3 <-
c(0,2,2,2,2,2,2,2,0,2,0,2,2,0,0,0,0,0,2,0,0,2,2,0,0,2,2,2,0,2,0,2,0,2,1,2,1,1,1,1,1,1,1,1,1,1,1,0,1,2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0)

dat <- data.frame(Y,X1,X2,X3)

I fit a logistic regression model to each of the variable to rank them based on
the residual difference (highest to lowest). To simplify I got the rank as X3,X1
and X2. Then, I fit a second order model as follows and again calculate the
res_dif :
mod1 <- glm(Y~X3+X1, family=binomial, data=dat)
mod1$null.deviance-mod1$deviance
mod2 <- glm(Y~X3+X2, family=binomial,data=dat)
mod2$null.deviance-mod2$deviance

Again, I will rank the model based on res_dif (highest to lowest). So here, I
choose mod2. From there I will fit the third order model as follows :
mod3 <- glm(Y~X3+X2+X1, family=binomial, data=dat)
mod3$null.deviance-mod3$deviance

Basically, this continues until it fits the maximum number of variables that you
have in the data.
My aim is to do variable selection based on res_dif instead of AIC, BIC or R2.
Since my actual dataset is dealing with 100 of variables, I wonder how can I
apply this using loop function.

Any suggestions would be appreciated.

Kind Regards
Shima


	[[alternative HTML version deleted]]

Bert Gunter

2016-Mar-18 18:52 UTC

head link

[R] variable selection using residual difference

Suggestion:

Don't do this!

I suggest that you consult with a local statistician or post to a
statistical website like stats.stackexchange.com for what might be
sensible procedures for variable selection (a complex and
controversial topic!) and why what you propose is or is not a good
idea (don't trust me!).

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Mar 18, 2016 at 9:00 AM, Hassan, Nazatulshima
<Nazatulshima.Hassan at liverpool.ac.uk> wrote:> I have the following example dataset
> set.seed(2001)
> n <- 100
> Y <-
c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
> X1 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.1,0.4,0.5))
> X2 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.5,0.25,0.25))
> X3 <-
c(0,2,2,2,2,2,2,2,0,2,0,2,2,0,0,0,0,0,2,0,0,2,2,0,0,2,2,2,0,2,0,2,0,2,1,2,1,1,1,1,1,1,1,1,1,1,1,0,1,2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0)
>
> dat <- data.frame(Y,X1,X2,X3)
>
> I fit a logistic regression model to each of the variable to rank them
based on the residual difference (highest to lowest). To simplify I got the rank
as X3,X1 and X2. Then, I fit a second order model as follows and again calculate
the res_dif :
> mod1 <- glm(Y~X3+X1, family=binomial, data=dat)
> mod1$null.deviance-mod1$deviance
> mod2 <- glm(Y~X3+X2, family=binomial,data=dat)
> mod2$null.deviance-mod2$deviance
>
> Again, I will rank the model based on res_dif (highest to lowest). So here,
I choose mod2. From there I will fit the third order model as follows :
> mod3 <- glm(Y~X3+X2+X1, family=binomial, data=dat)
> mod3$null.deviance-mod3$deviance
>
> Basically, this continues until it fits the maximum number of variables
that you have in the data.
> My aim is to do variable selection based on res_dif instead of AIC, BIC or
R2. Since my actual dataset is dealing with 100 of variables, I wonder how can I
apply this using loop function.
>
> Any suggestions would be appreciated.
>
> Kind Regards
> Shima
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Mar 2016 - variable selection using residual difference

[R] variable selection using residual difference

[R] variable selection using residual difference