Yes, that's true. On a test set, the highest probability of being in the
smaller class is about 40%. (Incidentally, accuracy on the test set is much
higher when I use the best-according-to-Kappa model instead of the
best-according-to-Accuracy model.)
It looks like the ctree() method supports weights, but all it does is
multiply the class likelihoods, which isn't what I want. (That is, if I
assign a weight of 2 to all of the small-class instances, it generates the
same model, but says that the likelihood for the most-confident instances is
about 80% instead of 40%!)
I'm still not really understanding why Kappa is not acting like a positive
monotonic function of Accuracy, though.
Thanks!
On Wed, Jun 22, 2011 at 8:12 PM, kuhnA03 <max.kuhn@pfizer.com> wrote:
> Harlan,
>
> It looks like your model is predicting (almost) everything to be the
> majority class (accuracy is almost the same as the largest class
> percentage). Try setting a test set aside and use confusionMatrix to look
at
> how the model is predicting in more detail. You can try other models that
> will let you weight the minority class higher to get a more balanced
> prediction.
>
> Max
>
>
>
> On 6/22/11 3:37 PM, "Harlan Harris" <harlan@harris.name>
wrote:
>
> Hello,
>
> When evaluating different learning methods for a categorization problem
> with the (really useful!) caret package, I'm getting confusing results
from
> the Kappa computation. The data is about 20,000 rows and a few dozen
> columns, and the categories are quite asymmetrical, 4.1% in one category
and
> 95.9% in the other. When I train a ctree model as:
>
> model <- train(dat.dts,
> dat.dts.class,
> method='ctree',
> tuneLength=8,
> trControl=trainControl(number = 5, workers=1),
> metric='Kappa')
>
> I get the following puzzling numbers:
>
>
>
> mincriterion Accuracy Kappa Accuracy SD Kappa SD
> 0.01 0.961 0.0609 0.00151 0.0264
> 0.15 0.962 0.049 0.00116 0.0248
> 0.29 0.963 0.0405 0.00227 0.035
> 0.43 0.964 0.0349 0.00257 0.0247
> 0.57 0.964 0.0382 0.0022 0.0199
> 0.71 0.964 0.0354 0.00255 0.0257
> 0.85 0.964 0.036 0.00224 0.024
> 0.99 0.965 0.0091 0.00173 0.0203
>
> (mincriterion determines the likelihood of accepting a split into the
> tree.) The Accuracy numbers look sorta reasonable, if not great; the model
> overfits and barely beats the base rate if it builds a complicated tree.
But
> the Kappa numbers go the opposite direction, and here's where I'm
not sure
> what's going on. The examples in the vingette show Accuracy and Kappa
being
> positively correlated. I thought Kappa was just (Accuracy - baserate)/(1 -
> baserate), but the reported Kappa is definitely not that.
>
> Suggestions? Aside from looking for a better model, which would be good
> advice here, what metric would you recommend? Thank you!
>
> -Harlan
>
>
>
[[alternative HTML version deleted]]