Thanks Terry! I managed to figure that out shortly after posting (as is the way!) Adding an additional covariate that splits below one of the x branches but not the other and means the class proportion to go over 0.5 means the x split is retained. However, I now have another conundrum, this time with rpart in anova mode... library(rpart) test_split <- function(offset) { y <- c(rep(0,10),rep(0.5,2)) + offset x <- c(rep(0,10),rep(1,2)) if (is.null(rpart(y ~ x, minsplit=1, cp=0, xval=0)$splits)) 0 else 1 } sum(replicate(1000, test_split(0))) # 1000, i.e. always splits sum(replicate(1000, test_split(0.5))) # 2-12, i.e. splits only sometimes... Adding a constant to y and getting different trees is a bit strange, particularly stochastically. Will see if I can track down a copy of the CART book. Jonathan ________________________________________ From: Therneau, Terry M., Ph.D. [therneau at mayo.edu] Sent: 16 May 2017 00:43 To: r-help at r-project.org; Marshall, Jonathan Subject: Re: Odd results from rpart classification tree You are mixing up two of the steps in rpart. 1: how to find the best candidate split and 2: evaluation of that split. With the "class" method we use the information or Gini criteria for step 1. The code finds a worthwhile candidate split at 0.5 using exactly the calculations you outline. For step 2 the criteria is the "decision theory" loss. In your data the estimated rate is 0 for the left node and 15/45 = .333 for the right node. As a decision rule both predict y=0 (since both are < 1/2). The split predicts 0 on the left and 0 on the right, so does nothing. The CART book (Brieman, Freidman, Olshen and Stone) on which rpart is based highlights the difference between odds-regression (for which the final prediction is a percent, and error is Gini) and classification. For the former treat y as continuous. Terry T. On 05/15/2017 05:00 AM, r-help-request at r-project.org wrote:> The following code produces a tree with only a root. However, clearly the tree with a split at x=0.5 is better. rpart doesn't seem to want to produce it. > > Running the following produces a tree with only root. > > y <- c(rep(0,65),rep(1,15),rep(0,20)) > x <- c(rep(0,70),rep(1,30)) > f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, parms=list(split='gini')) > > Computing the improvement for a split at x=0.5 manually: > > obs_L <- y[x<.5] > obs_R <- y[x>.5] > n_L <- sum(x<.5) > n_R <- sum(x>.5) > gini <- function(p) {sum(p*(1-p))} > impurity_root <- gini(prop.table(table(y))) > impurity_L <- gini(prop.table(table(obs_L))) > impurity_R <- gini(prop.table(table(obs_R))) > impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952 > > Thus, an improvement of 2.88 should result in a split. It does not. > > Why? > > Jonathan > >