thr3ads.net - R help - [R] R: How to prune using holdout data [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Alfredo

2017-Feb-27 15:48 UTC

[R] R: How to prune using holdout data

Thank you, Terry, for your answer. 

I'll try to explain better my question. When you create a classification or
regression tree you first grow a tree based on a splitting criteria: this
usually results in a large tree that provides a good fit to the training
data. The problem with this tree is its potential for overfitting the data:
the tree can be tailored too specifically to the training data and not
generalize well to new data. The solution (apart cross-validation) is to
find a smaller subtree that results in a low error rate on holdout or
validation data.

Hope it helps to clarity my question.

Best,

Alfredo

 

 

-----Messaggio originale-----
Da: Therneau, Terry M., Ph.D. [mailto:therneau at mayo.edu] 

You will need to give more detail of exactly what you mean by "prune using
a
validation set".  THe prune.rpart function will prune at any value you
want,
what I suspect you are looking for is to compute the error of each possible
tree, using a validation data set, then find the best one, and then prune
there.

How do you define "best"?


	[[alternative HTML version deleted]]

Therneau, Terry M., Ph.D.

2017-Feb-28 14:26 UTC

head link

[R] R: How to prune using holdout data

Let me give an outline of how to answer Alfredo's question via an example.
I will split the data set "lung" into two peices.  For these subjects
with
advanced lung cancer the physician's assessment of ECOG performance status
(ph.ecog) is one of the most powerful indicators of outcome.  Try to
predict it from other variables.

library(survival)   # for the test data set
library(rpart)

data1 <- lung[1:125,]
data2 <- lung[126:228,]

rfit1 <- rpart(ph.ecog ~ ., data=data1)
printcp(rfit1)

         CP nsplit rel error  xerror     xstd
1 0.565788      0   1.00000 1.04037 0.100516
2 0.098516      1   0.43421 0.44906 0.045001
3 0.042708      2   0.33570 0.35134 0.041692
4 0.031032      3   0.29299 0.37610 0.042971
5 0.019949      4   0.26196 0.37753 0.044692
6 0.010000      5   0.24201 0.39166 0.050332

# Validate using data2.  First get predictions for each of the pruned trees
cpvalues <- rfit1$cptable[,1]
pmat <- matrix(0, nrow(data2), length(cpvalues))
for (i in 1:length(cpvalues))
     pmat[,i] <- predict(prune(rfit1, cpvalues[i]), newdata=data2)

Now, we need to decide what on a measure of error.  Try simple squared error.

error <- colMeans((data2$ph.ecog - pmat)^2)
round(error, 3)
[1] 0.493 0.280 0.210 0.225 0.186 0.198

This is simple, but other cases are more complex.  The performace score is
actually an integer from 0-4 (5= dead), see
   http://ecog-acrin.org/resources/ecog-performance-status

table(lung$ph.ecog)
   0   1   2   3
  63 113  50   1

Suppose instead we fit a model and treat the response as categorical?
The total number of nested models is a bit smaller.

rfit2 <- rpart(ph.ecog ~ ., data=data1, method="class")

printcp(rfit2)
        CP nsplit rel error  xerror     xstd
1 0.35938      0   1.00000 1.00000 0.086951
2 0.12500      1   0.64062 0.64062 0.081854
3 0.06250      2   0.51562 0.70312 0.083662
4 0.03125      4   0.39062 0.57812 0.079610
5 0.01000      5   0.35938 0.56250 0.078977

  predict(rfit2, newdata=data2)[1:5,]
           0      1       2 3
126 0.03125 0.9375 0.03125 0
127 0.03125 0.9375 0.03125 0
128 0.03125 0.9375 0.03125 0
129 0.03125 0.9375 0.03125 0
130 0.37500 0.6250 0.00000 0

Now, we can ask for predicted probabilities for each class (default), which is a
vector
of length 4 for each subject, or for the predicted class, which is a single
value.  Which
do we want, and then what is the best measure of prediction error?
If three subjects with value 0 had prediction class vectors of (.8, .2, 0, 0),
(.8, .1, .1, 0) and (.45, .25, .2, .1), one outlook would say they all are the
same (all pick 0 as the best), others would give them different errors.  Is
the second prediction worse than the first?

What if the single subject with ph.ecog=3 had ended up in the validation data
set; how should we judge their prediction?

This complexity is one reason that there is not a simple function for
"validation" with a new data set.

On 02/27/2017 09:48 AM, Alfredo wrote:> Thank you, Terry, for your answer.
>
> I?ll try to explain better my question. When you create a classification or
regression
> tree you first grow a tree based on a splitting criteria: this usually
results in a large
> tree that provides a good fit to the training data. The problem with this
tree is its
> potential for overfitting the data: the tree can be tailored too
specifically to the
> training data and not generalize well to new data. The solution (apart
cross-validation)
> is to find a smaller subtree that results in a low error rate on *holdout
or validation data.*
>
> Hope it helps to clarity my question.
>
> Best,
>
> Alfredo
>
> -----Messaggio originale-----
> Da: Therneau, Terry M., Ph.D. [mailto:therneau at mayo.edu]
>
> You will need to give more detail of exactly what you mean by "prune
using a validation
> set". THe prune.rpart function will prune at any value you want, what
I suspect you are
> looking for is to compute the error of each possible tree, using a
validation data set,
> then find the best one, and then prune there.
>
> How do you define "best"?
>

R help - Feb 2017 - R: How to prune using holdout data

[R] R: How to prune using holdout data

[R] R: How to prune using holdout data