thr3ads.net - R help - [R] User defined split function in Rpart [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Paolo Radaelli

2007-Jan-03 16:56 UTC

[R] User defined split function in Rpart

Dear all,
 I'm trying to manage with user defined split function in rpart
(file rpart\tests\usersplits.R in 
http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of 
the email).
Suppose to have the following data.frame (note that x's values are already 
sorted)> Dy x
1 7 0.428
2 3 0.876
3 1 1.467
4 6 1.492
5 3 1.703
6 4 2.406
7 8 2.628
8 6 2.879
9 5 3.025
10 3 3.494
11 2 3.496
12 6 4.623
13 4 4.824
14 6 4.847
15 2 6.234
16 7 7.041
17 2 8.600
18 4 9.225
19 5 9.381
20 8 9.986

Running rpart and setting minbucket=1 and maxdepth=1 we get the following 
tree (which uses, by default, deviance):> rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1))    n= 20
    node), split, n, deviance, yval * denotes terminal node
1) root 20 84.80000 4.600000
2) D$x< 9.6835 19 72.63158 4.421053 *
3) D$x>=9.6835 1 0.00000 8.000000 *

This means that the first 19 observation has been sent to the left side of 
the tree and one observation to the right.
This is correct when we observe goodness (the maximum is the last element of 
the vector).

The thing i really don't understand is the direction vector.
# direction= -1 = send "y< cutpoint" to the left side of the tree
# 1 = send "y< cutpoint" to the right

What does it mean ?
In the example here considered we have> sign(lmean)[1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Which is the criterion used ?
In my opinion we should have all the values equal to -1 given that they have 
to be sent to left side of the tree.
Does someone can help me ?
Thank you

#######################################################
# The split function, where most of the work occurs.
# Called once per split variable per node.
# If continuous=T (the case here considered)
# The actual x variable is ordered
# y is supplied in the sort order of x, with no missings,
# return two vectors of length (n-1):
# goodness = goodness of the split, larger numbers are better.
# 0 = couldn't find any worthwhile split
# the ith value of goodness evaluates splitting obs 1:i vs (i+1):n
# direction= -1 = send "y< cutpoint" to the left side of the tree
# 1 = send "y< cutpoint" to the right
# this is not a big deal, but making larger "mean y's" move
towards
# the right of the tree, as we do here, seems to make it easier to
# read
# If continuos=F, x is a set of integers defining the groups for an
# unordered predictor. In this case:
# direction = a vector of length m= "# groups". It asserts that the
# best split can be found by lining the groups up in this order
# and going from left to right, so that only m-1 splits need to
# be evaluated rather than 2^(m-1)
# goodness = m-1 values, as before.
#
# The reason for returning a vector of goodness is that the C routine
# enforces the "minbucket" constraint. It selects the best return
value
# that is not too close to an edge.
The vector wt of weights in our case is:> wt[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

temp2 <- function(y, wt, x, parms, continuous) {
# Center y
n <- length(y)
y <- y- sum(y*wt)/sum(wt)
if (continuous) {
# continuous x variable
temp <- cumsum(y*wt)[-n]
left.wt <- cumsum(wt)[-n]
right.wt <- sum(wt) - left.wt
lmean <- temp/left.wt
rmean <- -temp/right.wt
goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
list(goodness= goodness, direction=sign(lmean))
}
}

Paolo Radaelli
Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
Facolt? di Economia
Universit? degli Studi di Milano-Bicocca
P.zza dell'Ateneo Nuovo, 1
20126 Milano
Italy
e-mail paolo.radaelli a unimib.it

R Help

2008-Feb-13 16:15 UTC

head link

[R] User defined split function in Rpart

Direction corresponds to goodness: for the split represented by
goodness[i], direction[i]=-1 means that values less than the split at
goodness[i] will go left, greater than will go right.   If
direction[i] = 1 then they will be sent to opposite sides.

The long-and-short of it is that, for most trees, we want to send
splits smaller than the split value left, and greater than right, so
direction should be -1 for all values, ie, direction rep(-1,length(goodness). 
The vector is only added if you want to
customize the structure of your tree.

Hope that helps,
Sam

On Jan 3, 2007 12:56 PM, Paolo Radaelli <paolo.radaelli at unimib.it>
wrote:> Dear all,
>  I'm trying to manage with user defined split function in rpart
> (file rpart\tests\usersplits.R in
> http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of
> the email).
> Suppose to have the following data.frame (note that x's values are
already
> sorted)
> > D
> y x
> 1 7 0.428
> 2 3 0.876
> 3 1 1.467
> 4 6 1.492
> 5 3 1.703
> 6 4 2.406
> 7 8 2.628
> 8 6 2.879
> 9 5 3.025
> 10 3 3.494
> 11 2 3.496
> 12 6 4.623
> 13 4 4.824
> 14 6 4.847
> 15 2 6.234
> 16 7 7.041
> 17 2 8.600
> 18 4 9.225
> 19 5 9.381
> 20 8 9.986
>
> Running rpart and setting minbucket=1 and maxdepth=1 we get the following
> tree (which uses, by default, deviance):
> > rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1))
>     n= 20
>     node), split, n, deviance, yval * denotes terminal node
> 1) root 20 84.80000 4.600000
> 2) D$x< 9.6835 19 72.63158 4.421053 *
> 3) D$x>=9.6835 1 0.00000 8.000000 *
>
> This means that the first 19 observation has been sent to the left side of
> the tree and one observation to the right.
> This is correct when we observe goodness (the maximum is the last element
of
> the vector).
>
> The thing i really don't understand is the direction vector.
> # direction= -1 = send "y< cutpoint" to the left side of the
tree
> # 1 = send "y< cutpoint" to the right
>
> What does it mean ?
> In the example here considered we have
> > sign(lmean)
> [1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1
>
> Which is the criterion used ?
> In my opinion we should have all the values equal to -1 given that they
have
> to be sent to left side of the tree.
> Does someone can help me ?
> Thank you
>
> #######################################################
> # The split function, where most of the work occurs.
> # Called once per split variable per node.
> # If continuous=T (the case here considered)
> # The actual x variable is ordered
> # y is supplied in the sort order of x, with no missings,
> # return two vectors of length (n-1):
> # goodness = goodness of the split, larger numbers are better.
> # 0 = couldn't find any worthwhile split
> # the ith value of goodness evaluates splitting obs 1:i vs (i+1):n
> # direction= -1 = send "y< cutpoint" to the left side of the
tree
> # 1 = send "y< cutpoint" to the right
> # this is not a big deal, but making larger "mean y's" move
towards
> # the right of the tree, as we do here, seems to make it easier to
> # read
> # If continuos=F, x is a set of integers defining the groups for an
> # unordered predictor. In this case:
> # direction = a vector of length m= "# groups". It asserts that
the
> # best split can be found by lining the groups up in this order
> # and going from left to right, so that only m-1 splits need to
> # be evaluated rather than 2^(m-1)
> # goodness = m-1 values, as before.
> #
> # The reason for returning a vector of goodness is that the C routine
> # enforces the "minbucket" constraint. It selects the best return
value
> # that is not too close to an edge.
> The vector wt of weights in our case is:
> > wt
> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> temp2 <- function(y, wt, x, parms, continuous) {
> # Center y
> n <- length(y)
> y <- y- sum(y*wt)/sum(wt)
> if (continuous) {
> # continuous x variable
> temp <- cumsum(y*wt)[-n]
> left.wt <- cumsum(wt)[-n]
> right.wt <- sum(wt) - left.wt
> lmean <- temp/left.wt
> rmean <- -temp/right.wt
> goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
> list(goodness= goodness, direction=sign(lmean))
> }
> }
>
> Paolo Radaelli
> Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
> Facolt? di Economia
> Universit? degli Studi di Milano-Bicocca
> P.zza dell'Ateneo Nuovo, 1
> 20126 Milano
> Italy
> e-mail paolo.radaelli at unimib.it
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jan 2007 - User defined split function in Rpart

[R] User defined split function in Rpart

[R] User defined split function in Rpart

Apparently Analagous Threads