Lin, Tin-Chi
2013-Oct-28 16:45 UTC
[R] high cross-validation errors ( xerror ) in regression tree?
Hi All, This is my first time to seek help from the R-forum (and also conduct a formal data-mining analysis). I searched the archive a bit but didn't get responses that could totally address my question. Any comments would be highly appreciated. I am using the rpart function to analyze factors that might contribute to heightened injury rate; the outcome is a continuous variable. After fitting the initial tree and pruning it, the final tree has five terminal nodes with cross-validation errors shown below, CP nsplit rel error xerror xstd 1 0.139141 0 1.00000 1.0033 0.26163 2 0.128314 1 0.86086 1.2752 0.28481 3 0.036021 3 0.60423 1.4315 0.29652 4 0.022675 4 0.56821 1.5142 0.29749 5 0.020000 5 0.54554 1.4615 0.28818 My questions are: (1) Is this pruned tree even valid? The cross-validation error is exceedingly high, well above 1.00 (2) What contributes to the high cross-validation errors (xerror), and why did it go up but then down a little bit? My guess was that the data is quite noisy; therefore, the splitting was pretty much based on random noise, resulting in poor prediction. I've found that it helps a bit to increase the minimum data points required at splitting and terminal nodes (at the expense of the overall R-square of the tree, inevitably), but still the problem lingers. Any thoughts? The initial tree command and the last five lines of the output is tree1 <- rpart(overexertion ~, method = "anova", data = data, xval = 10, minbucket=4, minsplit=10, cp=0) Root node error: 502364/347 = 1447.7 n=347 (179 observations deleted due to missingness)> tree1$cptable[dim(tree1$cptable)[1] - 5:0, ]CP nsplit rel error xerror xstd 43 9.769565e-05 54 0.2926771 1.641099 0.2626767 44 5.053530e-05 55 0.2925794 1.640780 0.2626771 45 4.314452e-05 56 0.2925288 1.640926 0.2626727 46 2.960797e-05 57 0.2924857 1.640925 0.2626727 47 1.570814e-05 58 0.2924561 1.640941 0.2626724 48 0.000000e+00 59 0.2924404 1.640906 0.2626728 The pruning command is fit9 <- prune(tree1, cp=.02) #set the cost-complexity parameter at 0.2 Tin-chi Lin Liberty Mutual Research Institute for Safety 71 Frankland Rd, Hopkinton, MA 01748 Ext: 732-7466 Phone: (508)4970266 Email: tin-chi.lin@LibertyMutual.com<mailto:tin-chi.lin@LibertyMutual.com> [[alternative HTML version deleted]]