Hi, I had a question regarding the rpart command in R. I used seven continuous predictor variables in the model and the variable called "TB122" was chosen for the first split. But in looking at the output, there are 4 variables that improve the predicted membership equally (TB122, TB139, TB144, and TB118) - output pasted below. Node number 1: 268 observations, complexity param=0.6 predicted class=0 expected loss=0.3 class counts: 197 71 probabilities: 0.735 0.265 left son=2 (188 obs) right son=3 (80 obs) Primary splits: TB122 < 80 to the left, improve=50, (0 missing) TB139 < 90 to the left, improve=50, (0 missing) TB144 < 90 to the left, improve=50, (0 missing) TB118 < 90 to the left, improve=50, (0 missing) TB129 < 100 to the left, improve=40, (0 missing) I need to know what methods R is using to select the best variable for the node. Somewhere I read that the best split = greatest improvement in predictive accuracy = maximum homogeneity of yes/no groups resulting from the split = reduction of impurity. I also read that the Gini index, Chi-square, or G-square can be used evaluate the level of impurity. For this function in R: 1) Why exactly did R pick TB122 over the other variables despite the fact that they all had the same level of improvement? Was TB122 chosen to be the first node because the groups "TB122<80" and "TB122>80" were the most homogeneous (ie had the least impurity)? 2) If R is using impurity to determine the best nodes, which method (the Gini index, Chi-square, or G-square) is R using? Thanks! Katie -- View this message in context: http://n4.nabble.com/rpart-classification-and-regression-trees-CART-tp962680p962680.html Sent from the R help mailing list archive at Nabble.com.
Terry Therneau
2009-Dec-14 14:25 UTC
[R] rpart - classification and regression trees (CART)
When two variables have exactly the same figure of merit, they will be listed in the output in the same order in which they appeared in your model statement. Terry Therneau -- begin inclusion --- I had a question regarding the rpart command in R. I used seven continuous predictor variables in the model and the variable called "TB122" was chosen for the first split. But in looking at the output, there are 4 variables that improve the predicted membership equally (TB122, TB139, TB144, and TB118) - output pasted below. Node number 1: 268 observations, complexity param=0.6 predicted class=0 expected loss=0.3 class counts: 197 71 probabilities: 0.735 0.265 left son=2 (188 obs) right son=3 (80 obs) Primary splits: TB122 < 80 to the left, improve=50, (0 missing) TB139 < 90 to the left, improve=50, (0 missing) TB144 < 90 to the left, improve=50, (0 missing) TB118 < 90 to the left, improve=50, (0 missing) TB129 < 100 to the left, improve=40, (0 missing) --- end inclusion ---