Tal Galili
2011-Jun-21 17:14 UTC
[R] How does rpart computes "improve" for split="information"?? (which seems to be different then the "gini" case)
Hello dear R-help members, I would appreciate any help in understanding how the rpart function computes the "improve" (which is given in fit$split) when using the split='information' parameter. Thanks to Professor Atkinson help, I was able to find how this is done in the case that split='gini'. By following the explanation here: http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf But the calculation of the information (deviance) impurity is still a mystery for me. Might you help with explaining it? Bellow is some R code simply showing how the gini is computed (and how the information is not as clear) # creating data set.seed(1324) y <- sample(c(0,1), 20, T) x <- y x[1:5] <- 0 # manually making the first split obs_L <- y[x<.5] obs_R <- y[x>.5] n_L <- sum(x<.5) n_R <- sum(x>.5) n <- length(x) calc.impurity <- function(func = gini) { impurity_root <- func(prop.table(table(y))) impurity_L <- func(prop.table(table(obs_L))) impurity_R <-func(prop.table(table(obs_R))) imp <- impurity_root - ((n_L/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757 imp*n } # for "gini" require(rpart) fit <- rpart(y~x, method = "class", parms=list(split='gini')) fit$split[,3] # 5.384615 gini <- function(p) {sum(p*(1-p))} calc.impurity(gini) # 5.384615 # success! # for "information" I fail... fit <- rpart(y~x, method = "class", parms=list(split='information')) fit$split[,3] # why is improve here 6.84029 ? entropy <- function(p) { if(any(p==1)) return(0) # works for the case when y has only 0 and 1 categories... -sum(p*log(p)) } calc.impurity(entropy) # 9.247559 != 6.84029 Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- [[alternative HTML version deleted]]