James Jong
2013-Feb-19 22:27 UTC
[R] CARET. Relationship between data splitting trainControl
I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl: "method" "index" and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds) To better frame my questions, let me use the following example from the documentation: ************************************* data(BloodBrain) set.seed(1) tmp <- createDataPartition(logBBB,p = .8, times = 100) trControl = trainControl(method = "LGOCV", index = tmp) ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl) ************************************* My questions are: 1) If I use createDataPartition (which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index to trainControl do I need to use LGOCV as the method in my call trainControl? If I use another one (e.g. cv.) What difference would it make? In my head, once you fix index, you are fixing the type of cross-validation, so I am not sure what role method plays if you use index. 2) What is the difference between createDataPartition and createResample? Is it that createDataPartition does stratified bootstrapping, while createResample doesn't? 3) How can I do **stratified** k-fold (e.g. 10 fold) cross validation using caret? Would the following do it? tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100) trControl = trainControl(method = "cv", index = tmp) ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl) Thanks so much in advance. CARET is a fantastic package and I am eager to learn how to use it properly. ~James [[alternative HTML version deleted]]