Israel Saeta PĂ©rez
2011-Dec-31 12:34 UTC
[R] Cross-validation error with tune and with rpart
Hello list, I'm trying to generate classifiers for a certain task using several methods, one of them being decision trees. The doubts come when I want to estimate the cross-validation error of the generated tree: tree <- rpart(y~., data=data.frame(xsel, y), cp=0.00001) ptree <- prune(tree, cp=tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]) ptree$cptable CP nsplit rel error xerror xstd 1 0.33120000 0 1.0000 1.0000 0.02856022 2 0.08640000 1 0.6688 0.6704 0.02683544 3 0.02986667 2 0.5824 0.5856 0.02584564 4 0.02880000 5 0.4928 0.5760 0.02571738 5 0.01920000 6 0.4640 0.5168 0.02484761 6 0.01440000 8 0.4256 0.5056 0.02466708 7 0.00960000 12 0.3552 0.5024 0.02461452 8 0.00880000 15 0.3264 0.4944 0.02448120 9 0.00800000 17 0.3088 0.4768 0.02417800 10 0.00480000 25 0.2448 0.4672 0.02400673 If I got it right, "xerror" stands for the cross-validation error (using 10-fold by default), this is pretty high (0.4672 over 1). However, if I do something similar using tune from e1071 I get a much lower error: treetune <- tune(rpart, y~., data=data.frame(xsel, y), predict.func treeClassPrediction, cp=0.0048)> treetune$best.performance[1] 0.2243049I'm also assuming that the performance returned by "tune" is the cross-validation error (also 10-fold by default). So where does this enormous difference come from? What am I missing? Also, "rel error" is the relative error in the training set? The documentation is not very descriptive: cptable- the table of optimal prunings based on a complexity parameter. Thanks and happy pre-new year, -- israel [[alternative HTML version deleted]]
Prof Brian Ripley
2011-Dec-31 14:13 UTC
[R] Cross-validation error with tune and with rpart
On 31/12/2011 12:34, Israel Saeta P?rez wrote:> Hello list, > > I'm trying to generate classifiers for a certain task using several > methods, one of them being decision trees. The doubts come when I want to > estimate the cross-validation error of the generated tree: > > tree<- rpart(y~., data=data.frame(xsel, y), cp=0.00001) > ptree<- prune(tree, > cp=tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]) > ptree$cptable > > > CP nsplit rel error xerror xstd > 1 0.33120000 0 1.0000 1.0000 0.02856022 > 2 0.08640000 1 0.6688 0.6704 0.02683544 > 3 0.02986667 2 0.5824 0.5856 0.02584564 > 4 0.02880000 5 0.4928 0.5760 0.02571738 > 5 0.01920000 6 0.4640 0.5168 0.02484761 > 6 0.01440000 8 0.4256 0.5056 0.02466708 > 7 0.00960000 12 0.3552 0.5024 0.02461452 > 8 0.00880000 15 0.3264 0.4944 0.02448120 > 9 0.00800000 17 0.3088 0.4768 0.02417800 > 10 0.00480000 25 0.2448 0.4672 0.02400673 > > > If I got it right, "xerror" stands for the cross-validation error (using > 10-fold by default), this is pretty high (0.4672 over 1).You didn't get it right. Please read the documentation, or contemplate why the first line is exactly one. In any case, that table is not about error rates for the final tree: it is part of the model selection step (to cross-validate the final tree you would need to include the choice of pruning inside the cross-validation) Did you look up the rpart technical report or one of the books explaining its output? Google 'rpart technical report' if you need to find it. [...] -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595