I have a binary classification problem where the fraction of positives is
very low, e.g. 20 positives in 10,000 examples (0.2%)
What is an appropriate cross validation scheme for training a classifier
with very few positives?
I currently have the following setup:
======================================= library(caret)
tmp <- createDataPartition(Y, p = 9/10, times = 3, list = TRUE)
myCtrl <- trainControl(method = "boot", index = tmp,
timingSamps = 2,
classProbs = TRUE, summaryFunction = twoClassSummary)
RFmodel <- train(X,Y,method='rf',trControl=myCtrl,tuneLength=1,
metric="ROC")
SVMmodel <-
train(X,Y,method='svmRadial',trControl=myCtrl,tuneLength=3,
metric="ROC")
KNNmodel <-
train(X,Y,method='knn',trControl=myCtrl,tuneLength=10,
metric="ROC")
NNmodel <- train(X,Y,method='nnet',trControl=myCtrl,tuneLength=3,
trace
= FALSE, metric="ROC")
=======================================but I am not getting good performance (my
ROC values are < 0.7 for all the
classifiers above). Any thoughts?
Thanks,
James
[[alternative HTML version deleted]]