Dear all, I would like to model the relationship between y and x. y is binary variable, and x is a count variable which may be possion-distribution. I think it is better to divide x into intervals and change it to a factor before calling glm(y~x,data=dat,family=binomail). I try to use rpart. As y is binary, I use "class" method and get the following result.> rpart(y~x,data=dat,method="class")n=778 (22 observations deleted due to missingness) node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 778 67 0 (0.91388175 0.08611825) * If with the default method, I get such a result.> rpart(y~x,data=dat)n=778 (22 observations deleted due to missingness) node), split, n, deviance, yval * denotes terminal node 1) root 778 61.230080 0.08611825 2) x< 19.5 750 53.514670 0.07733333 4) x< 1.25 390 17.169230 0.04615385 * 5) x>=1.25 360 35.555560 0.11111110 * 3) x>=19.5 28 6.107143 0.32142860 * If I use 1.25 and 19.5 as the cutting points, change x into factor by>x2 <- cut(q34b,breaks=c(0,1.25,19.5,200),right=F)The coef in y~x2 is significant and makes sense. My problem is: is it OK use the default method in rpart when response varibale is binary one? Thanks. -- Ronggui Huang Department of Sociology Fudan University, Shanghai, China
Prof Brian Ripley
2007-Jun-15 14:21 UTC
[R] method of rpart when response variable is binary?
On Fri, 15 Jun 2007, ronggui wrote:> Dear all, > > I would like to model the relationship between y and x. y is binary > variable, and x is a count variable which may be possion-distribution. > > I think it is better to divide x into intervals and change it to a > factor before calling glm(y~x,data=dat,family=binomail). > > I try to use rpart. As y is binary, I use "class" method and get the > following result. >> rpart(y~x,data=dat,method="class") > n=778 (22 observations deleted due to missingness) > > node), split, n, loss, yval, (yprob) > * denotes terminal node > > 1) root 778 67 0 (0.91388175 0.08611825) * > > > If with the default method, I get such a result. > >> rpart(y~x,data=dat) > n=778 (22 observations deleted due to missingness) > > node), split, n, deviance, yval > * denotes terminal node > > 1) root 778 61.230080 0.08611825 > 2) x< 19.5 750 53.514670 0.07733333 > 4) x< 1.25 390 17.169230 0.04615385 * > 5) x>=1.25 360 35.555560 0.11111110 * > 3) x>=19.5 28 6.107143 0.32142860 * > > If I use 1.25 and 19.5 as the cutting points, change x into factor by >> x2 <- cut(q34b,breaks=c(0,1.25,19.5,200),right=F) > > The coef in y~x2 is significant and makes sense. > > My problem is: is it OK use the default method in rpart when response > varibale is binary one? Thanks.Not unless you want a least-squares fit. Note that you have only 8.6% of one class, and for such an unbalanced classification problem you are unlikely to do better than declaring class 1 for all examples. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
you might use default setting if you use as.factor(y)~x in rpart(), I think. On 6/15/07, ronggui <ronggui.huang at gmail.com> wrote:> Dear all, > > I would like to model the relationship between y and x. y is binary > variable, and x is a count variable which may be possion-distribution. > > I think it is better to divide x into intervals and change it to a > factor before calling glm(y~x,data=dat,family=binomail). > > I try to use rpart. As y is binary, I use "class" method and get the > following result. > > rpart(y~x,data=dat,method="class") > n=778 (22 observations deleted due to missingness) > > node), split, n, loss, yval, (yprob) > * denotes terminal node > > 1) root 778 67 0 (0.91388175 0.08611825) * > > > If with the default method, I get such a result. > > > rpart(y~x,data=dat) > n=778 (22 observations deleted due to missingness) > > node), split, n, deviance, yval > * denotes terminal node > > 1) root 778 61.230080 0.08611825 > 2) x< 19.5 750 53.514670 0.07733333 > 4) x< 1.25 390 17.169230 0.04615385 * > 5) x>=1.25 360 35.555560 0.11111110 * > 3) x>=19.5 28 6.107143 0.32142860 * > > If I use 1.25 and 19.5 as the cutting points, change x into factor by > >x2 <- cut(q34b,breaks=c(0,1.25,19.5,200),right=F) > > The coef in y~x2 is significant and makes sense. > > My problem is: is it OK use the default method in rpart when response > varibale is binary one? Thanks. > > > -- > Ronggui Huang > Department of Sociology > Fudan University, Shanghai, China > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog)