Un texte encapsul? et encod? dans un jeu de caract?res inconnu a ?t? nettoy?... Nom : non disponible URL : <https://stat.ethz.ch/pipermail/r-help/attachments/20110704/68ecf4d2/attachment.pl>
One way around hacking rpart is to write code to do K fold samples based on unit outside rpart, then build trees using training sets and summarize scores on testing sets. Weidong Gu On Mon, Jul 4, 2011 at 9:22 AM, Katerine Goyer <katerine.goyer at uqtr.ca> wrote:> > > > > > > > Hello, > > > > I am using > the rpart function (from the rpart package) to do a regression tree that would describe > the behaviour of a fish species according to several environmental variables. > For each fish (sampling unit), I have repeated observations of the response > variable, which means that the data are not independent. Normally, in this > case, V-fold cross-validation needs to be modified to prevent over-optimistic > predictions of error rates by cross-validation and overestimation of the tree > size. A way to overcome this problem is by selecting only whole sampling units > in our subsets of cross-validation. My problem is that I don?t know how to > perform this modification of the cross-validation process in the rpart > function. > > > Is there a > way to do this modification in rpart or is there any other function I could use > that would consider interdependence in the response variable? > > > Here is an > example of the code I am using (?Y? being the response variable and ?data.env? > being a data frame of the environmental > variables): > > > Tree = rpart(Y > ~ X1 + X2 + X3,xval=100,data=data.env) > > > > Thanks > > Katerine > > > > ? ? ? ?[[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
On Mon, Jul 04, 2011 at 09:22:23AM -0400, Katerine Goyer wrote:> > Hello, > > I am using > the rpart function (from the rpart package) to do a regression tree that would describe > the behaviour of a fish species according to several environmental variables. > For each fish (sampling unit), I have repeated observations of the response > variable, which means that the data are not independent. Normally, in this > case, V-fold cross-validation needs to be modified to prevent over-optimistic > predictions of error rates by cross-validation and overestimation of the tree > size. A way to overcome this problem is by selecting only whole sampling units > in our subsets of cross-validation. My problem is that I don?t know how to > perform this modification of the cross-validation process in the rpart > function. > > > Is there a > way to do this modification in rpart or is there any other function I could use > that would consider interdependence in the response variable? > > > Here is an > example of the code I am using (?Y? being the response variable and ?data.env? > being a data frame of the environmental > variables): > > > Tree = rpart(Y > ~ X1 + X2 + X3,xval=100,data=data.env) >Hello. It may be needed to program crossvalidation at the R level using package tree, which does not contain crossvalidation itself. An example is as follows library(tree) X1 <- rnorm(200) X2 <- rnorm(200) X3 <- rnorm(200) Y <- ifelse(X1 > 0, X2, X3) data.env <- data.frame(X1, X2, X3, Y) ind <- rep(1:7, times=c(20, 30, 35, 30, 30, 25, 30)) # length(ind) == nrow(data.env) pred <- rep(NA, times=nrow(data.env)) for (i in unique(ind)) { Tree <- tree(Y ~ X1 + X2 + X3, data=data.env[ind != i, ]) PrunedTree <- prune.tree(Tree, best = 10) pred[ind == i] <- predict(PrunedTree, newdata=data.env[ind == i, ]) } plot(data.env$Y, pred, asp=1) The vector ind should be prepared so that all occurences of the same fish have the same value. See ?tree and ?prune.tree for further parameters. Consider also randomForest package, which may be more accurate, although it does not provide a comprehensible model. Hope this helps. Petr Savicky.
> Is there a way to do this modification in rpart or is there any other > function I could use that would consider interdependence in the > response variable?This feature already exists: the "xval" option can be a vector of integers that defines the "left out" groups. First all the 1's are left out, then the 2s, then the 3s, etc. Unfortunately this was overlooked in the documentation for rpart.control; it's in the documentation for xpred.rpart though. I'll fix this soon as part of some other updates. Terry Therneau