thr3ads.net - R help - [R] Does rpart package have some requirements on the original data set? [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Liu, Ningwei

2007-Feb-15 19:19 UTC

[R] Does rpart package have some requirements on the original data set?

Hi,

 

I am currently studying Decision Trees by using rpart package in R. I
artificially created a data set which includes the dependant variable
(y) and a few independent variables (x1, x2...). The dependant variable
y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I
apply rpart to it, there is no splitting at all.

 

I am wondering whether this is because of the "special" distribution
of
y. Since the majority of y is 1 (information in the data set is small),
rpart automatically regards it as already a single class and therefore
won't proceed any further. If this understanding is correct, what I
should do if I still want rpart to do something on this data set?

 

 

Thanks a lot!

 

 

Ningwei     


	[[alternative HTML version deleted]]

Betty Health

2007-Feb-15 22:21 UTC

head link

[R] Does rpart package have some requirements on the original data set?

Hi, Ningwei, I think two things could help to improve the error rate for the
minority group. One is to assign bigger prior to the minority group; the
other is to make the complexity parameter (cp) smaller.

Betty


On 2/15/07, Liu, Ningwei <ningwei.liu@countryfinancial.com>
wrote:>
> Hi,
>
>
>
> I am currently studying Decision Trees by using rpart package in R. I
> artificially created a data set which includes the dependant variable
> (y) and a few independent variables (x1, x2...). The dependant variable
> y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I
> apply rpart to it, there is no splitting at all.
>
>
>
> I am wondering whether this is because of the "special"
distribution of
> y. Since the majority of y is 1 (information in the data set is small),
> rpart automatically regards it as already a single class and therefore
> won't proceed any further. If this understanding is correct, what I
> should do if I still want rpart to do something on this data set?
>
>
>
>
>
> Thanks a lot!
>
>
>
>
>
> Ningwei
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Roberto Perdisci

2007-Feb-15 23:51 UTC

head link

[R] Does rpart package have some requirements on the original data set?

Hi,
  try to set minsplit=2 and cp=0. After training you can prune with
different values of cp, and plot how the accuracy changes.

try this code (which I'm sure can be improved)

require(rpart)

rpart.prune.stats <- function(unpruned.tree,testset,class.index.name,cp) {
    acc.rpart.pruned <- list()
    nnodes <- NULL

    rpart.pruned <- unpruned.tree;
    for(i in 1:length(cp)) {
        print(paste("cp =",cp[i]))

        rpart.pruned <- prune(rpart.pruned,cp[i])
        pred.rpart.pruned <-
predict(rpart.pruned,testset,type="class")
        acc <-
sum(pred.rpart.pruned==testset[,class.index.name])/nrow(testset)
        acc.rpart.pruned <- c(acc.rpart.pruned,list(acc))
        nnodes <- c(nnodes,nrow(rpart.pruned$frame))
    }

    return(list(acc = acc.rpart.pruned, nnodes = nnodes))
}


plot.rpart.prune.results <-
function(formula,traininingset,testset,class.index.name,dataset.name,cp,add=F,ylim=NULL)
{

     rpart.unpruned <-
rpart(formula,data=traininingset,control=rpart.control(minsplit=2,cp=0))
     res <- rpart.prune.stats(rpart.unpruned,testset,class.index.name,cp)

     x <- unlist(res$acc)
     y <- unlist(res$nnodes)

     print(x)
     print(y)


    if(add)
        par(new=T)
    plot(cp,x,type="l",col="blue",ylim=ylim,ann=F)
   
text(cp[c(seq(1,length(cp),by=5))],x[c(seq(1,length(cp),by=5))],paste("(",y[seq(1,length(cp),by=5)],")",sep=""),pos=3,cex=0.5)
   
title(main=dataset.name,xlab="cp",ylab="Accuracy",font=3,cex=0.5)
}


and call it using something similar
plot.rpart.prune.results(Class~.,DatasetX.train,DatasetX.test,"Class","DatasetX",cp=seq(0,0.005,by=0.0001))


You can also oversample the minority class using sampling with
replacement or undersample the majority class.  This are two very
simple techniques used in machine learning when dealing with
unbalanced datasets (there are more complicated techniques which
produce better results, though)

hope this helps,
cheers,
Roberto

On 2/15/07, Liu, Ningwei <ningwei.liu at countryfinancial.com>
wrote:> Hi,
>
>
>
> I am currently studying Decision Trees by using rpart package in R. I
> artificially created a data set which includes the dependant variable
> (y) and a few independent variables (x1, x2...). The dependant variable
> y only comprises 0 and 1. 90% of y are 1 and 10% of y are 0. When I
> apply rpart to it, there is no splitting at all.
>
>
>
> I am wondering whether this is because of the "special"
distribution of
> y. Since the majority of y is 1 (information in the data set is small),
> rpart automatically regards it as already a single class and therefore
> won't proceed any further. If this understanding is correct, what I
> should do if I still want rpart to do something on this data set?
>
>
>
>
>
> Thanks a lot!
>
>
>
>
>
> Ningwei
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Apparently Analagous Threads

Search for more maybe matching threads

R help - Feb 2007 - Does rpart package have some requirements on the original data set?

[R] Does rpart package have some requirements on the original data set?

[R] Does rpart package have some requirements on the original data set?

[R] Does rpart package have some requirements on the original data set?

Apparently Analagous Threads