Robert Smith
2009-Jul-26 18:37 UTC
[R] Question about rpart decision trees (being used to predict customer churn)
Hi, I am using rpart decision trees to analyze customer churn. I am finding that the decision trees created are not effective because they are not able to recognize factors that influence churn. I have created an example situation below. What do I need to do to for rpart to build a tree with the variable experience? My guess is that this would happen if rpart used the loss matrix while creating the tree.> experience <- as.factor(c(rep("good",90), rep("bad",10))) > cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5),rep("yes",5)))> table(experience, cancel)cancel experience no yes bad 5 5 good 85 5> rpart(cancel ~ experience)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 10 no (0.9000000 0.1000000) * I tried the following commands with no success. rpart(cancel ~ experience, control=rpart.control(cp=.0001)) rpart(cancel ~ experience, parms=list(split='information')) rpart(cancel ~ experience, parms=list(split='information'), control=rpart.control(cp=.0001)) rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2, ncol=2))) Thanks a lot for your help. Best regards, Robert [[alternative HTML version deleted]]
Terry Therneau
2009-Jul-27 12:42 UTC
[R] Question about rpart decision trees (being used to predict customer churn)
-- begin included message --- Hi, I am using rpart decision trees to analyze customer churn. I am finding that the decision trees created are not effective because they are not able to recognize factors that influence churn. I have created an example situation below. What do I need to do to for rpart to build a tree with the variable experience? My guess is that this would happen if rpart used the loss matrix while creating the tree.> experience <- as.factor(c(rep("good",90), rep("bad",10))) > cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5),rep("yes",5)))> table(experience, cancel)cancel experience no yes bad 5 5 good 85 5> rpart(cancel ~ experience)n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 10 no (0.9000000 0.1000000) * I tried the following commands with no success. rpart(cancel ~ experience, control=rpart.control(cp=.0001)) rpart(cancel ~ experience, parms=list(split='information')) rpart(cancel ~ experience, parms=list(split='information'), control=rpart.control(cp=.0001)) rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2, ncol=2))) --- end inclusion -------- The program works fine with rpart(as.numeric(cancel) ~ experience), which does a fit to try and predict the probability of cancellation rather than a YES/NO decision for each node. I usually find this more informative, particularly for early analysis. Brieman et al in the original CART book refer to this as odds regression. In this analysis, if a split leads to one child with 30% cancel and another with 5% cancellation the split is successful. When using a factor as the y variable, this split is scored as useless, since the parent and both children are scored as "NO". By adjusting the losses to be just right you can get your data to split. You need to make them such that 85/5 is predicted as 'no cancel' and 5/5 as 'yes cancel'; 1:2 losses would suffice. In the example where you set losses to 1:10000 both nodes are scored as a 'yes'. Terry Therneau
Carlos J. Gil Bellosta
2009-Aug-01 19:24 UTC
[R] Question about rpart decision trees (being used to predict customer churn)
Hello, If you do my.tree <- rpart(cancel ~ experience) and then you check my.tree$frame you will note that the complexity parameter there is 0. Check ?rpart.object to get a description of what this output means. But essentially, you will not be able to break the leaf unless you set a complexity parameter below that value, this is, never. You may need to go into the internals of the function (and the C code) in order to understand how this parameter is calculated. It looks to me as an oddity and it is worth trying to understand why. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com P.S.: Note that there is a bug in your submitted code that requires some hand fixing. On Sun, 2009-07-26 at 11:37 -0700, Robert Smith wrote:> Hi, > > I am using rpart decision trees to analyze customer churn. I am finding that > the decision trees created are not effective because they are not able to > recognize factors that influence churn. I have created an example situation > below. What do I need to do to for rpart to build a tree with the variable > experience? My guess is that this would happen if rpart used the loss matrix > while creating the tree. > > > experience <- as.factor(c(rep("good",90), rep("bad",10))) > > cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5), > rep("yes",5))) > > table(experience, cancel) > cancel > experience no yes > bad 5 5 > good 85 5 > > rpart(cancel ~ experience) > n= 100 > node), split, n, loss, yval, (yprob) > * denotes terminal node > 1) root 100 10 no (0.9000000 0.1000000) * > > I tried the following commands with no success. > rpart(cancel ~ experience, control=rpart.control(cp=.0001)) > rpart(cancel ~ experience, parms=list(split='information')) > rpart(cancel ~ experience, parms=list(split='information'), > control=rpart.control(cp=.0001)) > rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2, > ncol=2))) > > Thanks a lot for your help. > > Best regards, > Robert > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Graham Williams
2009-Aug-01 22:41 UTC
[R] Question about rpart decision trees (being used to predict customer churn)
2009/7/27 Robert Smith <robertpsmith2008@gmail.com>> Hi, > > I am using rpart decision trees to analyze customer churn. I am finding > that > the decision trees created are not effective because they are not able to > recognize factors that influence churn. I have created an example situation > below. What do I need to do to for rpart to build a tree with the variable > experience? My guess is that this would happen if rpart used the loss > matrix > while creating the tree. > > > experience <- as.factor(c(rep("good",90), rep("bad",10))) > > cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5), > rep("yes",5))) > > table(experience, cancel) > cancel > experience no yes > bad 5 5 > good 85 5 > > rpart(cancel ~ experience) > n= 100 > node), split, n, loss, yval, (yprob) > * denotes terminal node > 1) root 100 10 no (0.9000000 0.1000000) * > > I tried the following commands with no success. > rpart(cancel ~ experience, control=rpart.control(cp=.0001)) > rpart(cancel ~ experience, parms=list(split='information')) > rpart(cancel ~ experience, parms=list(split='information'), > control=rpart.control(cp=.0001)) > rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2, > ncol=2))) > > Thanks a lot for your help. > > Best regards, > Robert >Hi Robert, Perhaps try a less extreme loss matrix: rpart(cancel ~ experience, parms=list(loss=matrix(c(0,5,1,0), byrow=TRUE, nrow=2))) Output from Rattle: Summary of the Tree model for Classification (built using rpart): n= 100 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100 50 no (0.90000000 0.10000000) 2) experience=good 90 25 no (0.94444444 0.05555556) * 3) experience=bad 10 5 yes (0.50000000 0.50000000) * Classification tree: rpart(formula = cancel ~ ., data = crs$dataset, method = "class", parms = list(loss = matrix(c(0, 5, 1, 0), byrow = TRUE, nrow = 2)), control = rpart.control(cp = 0.0001, usesurrogate = 0, maxsurrogate 0)) Variables actually used in tree construction: [1] experience Root node error: 50/100 = 0.5 n= 100 CP nsplit rel error xerror xstd 1 0.4000 0 1.0 1.0 0.30 2 0.0001 1 0.6 0.6 0.22 TRAINING DATA Error Matrix - Counts Actual Predicted no yes no 85 5 yes 5 5 TRAINING DATA Error Matrix - Percentages Actual Predicted no yes no 85 5 yes 5 5 Time taken: 0.01 secs Generated by Rattle 2009-08-02 08:24:50 gjw ===================================================================== [[alternative HTML version deleted]]