Liu, Ningwei
2007-Feb-15 19:19 UTC
[R] Does rpart package have some requirements on the original data set?
Hi, I am currently studying Decision Trees by using rpart package in R. I artificially created a data set which includes the dependant variable (y) and a few independent variables (x1, x2...). The dependant variable y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I apply rpart to it, there is no splitting at all. I am wondering whether this is because of the "special" distribution of y. Since the majority of y is 1 (information in the data set is small), rpart automatically regards it as already a single class and therefore won't proceed any further. If this understanding is correct, what I should do if I still want rpart to do something on this data set? Thanks a lot! Ningwei [[alternative HTML version deleted]]
Betty Health
2007-Feb-15 22:21 UTC
[R] Does rpart package have some requirements on the original data set?
Hi, Ningwei, I think two things could help to improve the error rate for the minority group. One is to assign bigger prior to the minority group; the other is to make the complexity parameter (cp) smaller. Betty On 2/15/07, Liu, Ningwei <ningwei.liu@countryfinancial.com> wrote:> > Hi, > > > > I am currently studying Decision Trees by using rpart package in R. I > artificially created a data set which includes the dependant variable > (y) and a few independent variables (x1, x2...). The dependant variable > y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I > apply rpart to it, there is no splitting at all. > > > > I am wondering whether this is because of the "special" distribution of > y. Since the majority of y is 1 (information in the data set is small), > rpart automatically regards it as already a single class and therefore > won't proceed any further. If this understanding is correct, what I > should do if I still want rpart to do something on this data set? > > > > > > Thanks a lot! > > > > > > Ningwei > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Roberto Perdisci
2007-Feb-15 23:51 UTC
[R] Does rpart package have some requirements on the original data set?
Hi, try to set minsplit=2 and cp=0. After training you can prune with different values of cp, and plot how the accuracy changes. try this code (which I'm sure can be improved) require(rpart) rpart.prune.stats <- function(unpruned.tree,testset,class.index.name,cp) { acc.rpart.pruned <- list() nnodes <- NULL rpart.pruned <- unpruned.tree; for(i in 1:length(cp)) { print(paste("cp =",cp[i])) rpart.pruned <- prune(rpart.pruned,cp[i]) pred.rpart.pruned <- predict(rpart.pruned,testset,type="class") acc <- sum(pred.rpart.pruned==testset[,class.index.name])/nrow(testset) acc.rpart.pruned <- c(acc.rpart.pruned,list(acc)) nnodes <- c(nnodes,nrow(rpart.pruned$frame)) } return(list(acc = acc.rpart.pruned, nnodes = nnodes)) } plot.rpart.prune.results <- function(formula,traininingset,testset,class.index.name,dataset.name,cp,add=F,ylim=NULL) { rpart.unpruned <- rpart(formula,data=traininingset,control=rpart.control(minsplit=2,cp=0)) res <- rpart.prune.stats(rpart.unpruned,testset,class.index.name,cp) x <- unlist(res$acc) y <- unlist(res$nnodes) print(x) print(y) if(add) par(new=T) plot(cp,x,type="l",col="blue",ylim=ylim,ann=F) text(cp[c(seq(1,length(cp),by=5))],x[c(seq(1,length(cp),by=5))],paste("(",y[seq(1,length(cp),by=5)],")",sep=""),pos=3,cex=0.5) title(main=dataset.name,xlab="cp",ylab="Accuracy",font=3,cex=0.5) } and call it using something similar plot.rpart.prune.results(Class~.,DatasetX.train,DatasetX.test,"Class","DatasetX",cp=seq(0,0.005,by=0.0001)) You can also oversample the minority class using sampling with replacement or undersample the majority class. This are two very simple techniques used in machine learning when dealing with unbalanced datasets (there are more complicated techniques which produce better results, though) hope this helps, cheers, Roberto On 2/15/07, Liu, Ningwei <ningwei.liu at countryfinancial.com> wrote:> Hi, > > > > I am currently studying Decision Trees by using rpart package in R. I > artificially created a data set which includes the dependant variable > (y) and a few independent variables (x1, x2...). The dependant variable > y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I > apply rpart to it, there is no splitting at all. > > > > I am wondering whether this is because of the "special" distribution of > y. Since the majority of y is 1 (information in the data set is small), > rpart automatically regards it as already a single class and therefore > won't proceed any further. If this understanding is correct, what I > should do if I still want rpart to do something on this data set? > > > > > > Thanks a lot! > > > > > > Ningwei > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >