Martin Guetlein
2012-Jan-13 08:49 UTC
[R] how to create stratified (cross-validation) partitions according to numerical features
Hi all, I want to fragment a dataset into k-cross-validation partitions (folds). The content of the folds should be stratified, but not according to a single (categorical) feature, but according to a range of features (numeric, if possible numeric and categorical). Does anybody know a way to do this? I only found a way to do this for a single split (training-test split) with the package sampling. I will paste the example code for the training-test split below to make clear what I am looking for. With best regards, Martin example code: library("sampling") data <- as.matrix( iris[1:4] ) # skipping iris class column as this method only works for numerical features, but thats ok prob <- 0.3 # probability to be selected into test set samplecube(data, pik=rep(prob, times=nrow(data)), order=2)>>>[...] QUALITY OF BALANCING TOTALS HorvitzThompson_estimators Relative_deviation Sepal.Length 876.5 874.6667 -0.20916524 Sepal.Width 458.6 458.3333 -0.05814799 Petal.Length 563.7 563.3333 -0.06504642 Petal.Width 179.9 178.6667 -0.68556606 [1] 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 [38] 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 [75] 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 [112] 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 [149] 0 0 -- Dipl-Inf. Martin G?tlein Phone: +49 (0)761 203 7633 (office) +49 (0)177 623 9499 (mobile) Email: guetlein at informatik.uni-freiburg.de