Tal Galili
2011-Jun-21 17:14 UTC
[R] How does rpart computes "improve" for split="information"?? (which seems to be different then the "gini" case)
Hello dear R-help members,
I would appreciate any help in understanding how the rpart function computes
the "improve" (which is given in fit$split) when using the
split='information' parameter.
Thanks to Professor Atkinson help, I was able to find how this is done in
the case that split='gini'. By following the explanation here:
http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf
But the calculation of the information (deviance) impurity is still a
mystery for me.
Might you help with explaining it?
Bellow is some R code simply showing how the gini is computed (and how the
information is not as clear)
# creating data
set.seed(1324)
y <- sample(c(0,1), 20, T)
x <- y
x[1:5] <- 0
# manually making the first split
obs_L <- y[x<.5]
obs_R <- y[x>.5]
n_L <- sum(x<.5)
n_R <- sum(x>.5)
n <- length(x)
calc.impurity <- function(func = gini)
{
impurity_root <- func(prop.table(table(y)))
impurity_L <- func(prop.table(table(obs_L)))
impurity_R <-func(prop.table(table(obs_R)))
imp <- impurity_root - ((n_L/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757
imp*n
}
# for "gini"
require(rpart)
fit <- rpart(y~x, method = "class",
parms=list(split='gini'))
fit$split[,3] # 5.384615
gini <- function(p) {sum(p*(1-p))}
calc.impurity(gini) # 5.384615 # success!
# for "information" I fail...
fit <- rpart(y~x, method = "class",
parms=list(split='information'))
fit$split[,3] # why is improve here 6.84029 ?
entropy <- function(p) {
if(any(p==1)) return(0) # works for the case when y has only 0 and 1
categories...
-sum(p*log(p))
}
calc.impurity(entropy) # 9.247559 != 6.84029
Thanks,
Tal
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
[[alternative HTML version deleted]]
