Lin, Tin-Chi
2013-Oct-28 16:45 UTC
[R] high cross-validation errors ( xerror ) in regression tree?
Hi All,
This is my first time to seek help from the R-forum (and also conduct a formal
data-mining analysis). I searched the archive a bit but didn't get responses
that could totally address my question. Any comments would be highly
appreciated.
I am using the rpart function to analyze factors that might contribute to
heightened injury rate; the outcome is a continuous variable. After fitting the
initial tree and pruning it, the final tree has five terminal nodes with
cross-validation errors shown below,
CP nsplit rel error xerror xstd
1 0.139141 0 1.00000 1.0033 0.26163
2 0.128314 1 0.86086 1.2752 0.28481
3 0.036021 3 0.60423 1.4315 0.29652
4 0.022675 4 0.56821 1.5142 0.29749
5 0.020000 5 0.54554 1.4615 0.28818
My questions are:
(1) Is this pruned tree even valid? The cross-validation error is exceedingly
high, well above 1.00
(2) What contributes to the high cross-validation errors (xerror), and why did
it go up but then down a little bit?
My guess was that the data is quite noisy; therefore, the splitting was pretty
much based on random noise, resulting in poor prediction. I've found that it
helps a bit to increase the minimum data points required at splitting and
terminal nodes (at the expense of the overall R-square of the tree, inevitably),
but still the problem lingers.
Any thoughts?
The initial tree command and the last five lines of the output is
tree1 <- rpart(overexertion ~, method = "anova", data = data, xval
= 10, minbucket=4, minsplit=10, cp=0)
Root node error: 502364/347 = 1447.7
n=347 (179 observations deleted due to missingness)
> tree1$cptable[dim(tree1$cptable)[1] - 5:0, ]
CP nsplit rel error xerror xstd
43 9.769565e-05 54 0.2926771 1.641099 0.2626767
44 5.053530e-05 55 0.2925794 1.640780 0.2626771
45 4.314452e-05 56 0.2925288 1.640926 0.2626727
46 2.960797e-05 57 0.2924857 1.640925 0.2626727
47 1.570814e-05 58 0.2924561 1.640941 0.2626724
48 0.000000e+00 59 0.2924404 1.640906 0.2626728
The pruning command is
fit9 <- prune(tree1, cp=.02) #set the cost-complexity parameter at 0.2
Tin-chi Lin
Liberty Mutual Research Institute for Safety
71 Frankland Rd, Hopkinton, MA 01748
Ext: 732-7466
Phone: (508)4970266
Email: tin-chi.lin@LibertyMutual.com<mailto:tin-chi.lin@LibertyMutual.com>
[[alternative HTML version deleted]]