Tal Galili
2011-Jun-13 12:47 UTC
[R] In rpart, how is "improve" calculated? (in the "class" case)
Hi all, I apologies in advance if I am missing something very simple here, but since I failed at resolving this myself, I'm sending this question to the list. I would appreciate any help in understanding how the rpart function is (exactly) computing the "improve" (which is given in fit$split), and how it differs when using the split='information' vs split='gini' parameters. According to the help in rpart.object: "improve, which is the improvement in deviance given by this split">From what I understand, that would mean that the "improve" value should notbe different when using different "split" switches. Since it is different, then I suspect that it is reflecting the impurity measure somehow, but I can't seem to understand how exactly. Bellow is some simple R code showing the result for a simple classification tree, with what the function outputs, and what I would have expected to see if "improve" were to simply reflect the change in impurity. set.seed(1324) y <- sample(c(0,1), 20, T) x <- y x[1:5] <- 0 require(rpart) fit <- rpart(y~x, method = "class", parms=list(split='information')) fit$split[,3] # why is improve here 6.84 ? fit <- rpart(y~x, method = "class", parms=list(split='gini')) fit$split[,3] # why is improve here 5.38 ? # Here is what I thought it should have been: # for "information" entropy <- function(p) { if(any(p==1)) return(0) # works for the case when y has only 0 and 1 categories... -sum(p*log(p,2)) } gini <- function(p) {sum(p*(1-p))} obs_1 <- y[x>.5] obs_0 <- y[x<.5] n_l <- sum(x>.5) n_R <- sum(x<.5) n <- length(x) # for entropy (information) impurity_root <- entropy(prop.table(table(y))) impurity_l <- entropy(prop.table(table(obs_0))) impurity_R <-entropy(prop.table(table(obs_1))) # shouldn't this have been "improve" ?? impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.7272 # for "gini" impurity_root <- gini(prop.table(table(y))) impurity_l <- gini(prop.table(table(obs_0))) impurity_R <-gini(prop.table(table(obs_1))) impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757 Thanks upfront, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- [[alternative HTML version deleted]]
Ed Merkle
2011-Jun-14 19:56 UTC
[R] In rpart, how is "improve" calculated? (in the "class" case)
Tal, For the Gini criterion, the "improve" value can be calculated as a weighted sum of the improvement in impurity. Continuing with your original code: # for "gini" impurity_root<- gini(prop.table(table(y))) impurity_l<- gini(prop.table(table(obs_0))) impurity_R<-gini(prop.table(table(obs_1))) # (13 and 7 are sample sizes in respective nodes) 13*(impurity_root - impurity_l) + 7*(impurity_root - impurity_R) [1] 5.384615 This does not appear to extend immediately to the information criterion, however. I'm not sure about the 6.84. Ed On 6/14/11 5:00 AM, r-help-request at r-project.org wrote:> ------------------------------ > > Message: 4 > Date: Mon, 13 Jun 2011 15:47:26 +0300 > From: Tal Galili<tal.galili at gmail.com> > To:r-help at r-project.org > Subject: [R] In rpart, how is "improve" calculated? (in the "class" > case) > Message-ID:<BANLkTimp1aFQoYrKina7H0Rnk=0zKR_iDw at mail.gmail.com> > Content-Type: text/plain > > Hi all, > > I apologies in advance if I am missing something very simple here, but since > I failed at resolving this myself, I'm sending this question to the list. > > I would appreciate any help in understanding how the rpart function is > (exactly) computing the "improve" (which is given in fit$split), and how it > differs when using the split='information' vs split='gini' parameters. > > According to the help in rpart.object: > "improve, which is the improvement in deviance given by this split" >> From what I understand, that would mean that the "improve" value should not > be different when using different "split" switches. Since it is different, > then I suspect that it is reflecting the impurity measure somehow, but I > can't seem to understand how exactly. > > Bellow is some simple R code showing the result for a simple classification > tree, with what the function outputs, and what I would have expected to see > if "improve" were to simply reflect the change in impurity. > > > set.seed(1324) > y<- sample(c(0,1), 20, T) > x<- y > x[1:5]<- 0 > require(rpart) > fit<- rpart(y~x, method = "class", parms=list(split='information')) > fit$split[,3] # why is improve here 6.84 ? > fit<- rpart(y~x, method = "class", parms=list(split='gini')) > fit$split[,3] # why is improve here 5.38 ? > > > # Here is what I thought it should have been: > # for "information" > entropy<- function(p) { > if(any(p==1)) return(0) # works for the case when y has only 0 and 1 > categories... > -sum(p*log(p,2)) > } > gini<- function(p) {sum(p*(1-p))} > > obs_1<- y[x>.5] > obs_0<- y[x<.5] > n_l<- sum(x>.5) > n_R<- sum(x<.5) > n<- length(x) > > # for entropy (information) > impurity_root<- entropy(prop.table(table(y))) > impurity_l<- entropy(prop.table(table(obs_0))) > impurity_R<-entropy(prop.table(table(obs_1))) > # shouldn't this have been "improve" ?? > impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.7272 > > # for "gini" > impurity_root<- gini(prop.table(table(y))) > impurity_l<- gini(prop.table(table(obs_0))) > impurity_R<-gini(prop.table(table(obs_1))) > impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757 > > > Thanks upfront, > Tal > > > ----------------Contact > Details:------------------------------------------------------- > Contact me:Tal.Galili at gmail.com | 972-52-7275845 > Read me:www.talgalili.com (Hebrew) |www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > ------------------------------------------------------------------------------------------------ *** Note new email address *** Ed Merkle, PhD Assistant Professor Department of Psychological Sciences (starting August 2011) University of Missouri Columbia, MO, USA 65211