thr3ads.net - R help - [R] caret package: custom summary function in trainControl doesn't work with oob? [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Matthew Francis

2012-Apr-13 03:12 UTC

[R] caret package: custom summary function in trainControl doesn't work with oob?

Hi all,

I've been using a custom summary function to optimise regression model
methods using the caret package. This has worked smoothly. I've been using
the default bootstrapping resampling method. For bagging models
(specifically randomForest in this case) caret can, in theory, uses the
out-of-bag (oob) error estimate from the model instead of resampling, which
(in theory) is largely redundant for such models. Since they take a while
to build in the first place, it really slows things down when estimating
performance using boostrap.

I can successfully run either using the oob 'resampling method' with the
default RMSE optimisation, or run using bootstrap and my custom
summaryFunction as the thing to optimise, but they don't work together. If
I try and use oob and supply a summaryFunction caret throws an error saying
it can't find the relevant metric.

Now, if caret is simply polling the randomForest object for the stored oob
error I can understand this limitation, but in the case of randomForest
(and probably other bagging methods?) the training function can be asked to
return information about the individual tree predictions and whether data
points were oob in each case. With this information you can reconstruct an
oob 'error' using whatever function you choose to target for
optimisation.
As far as I can tell, caret is not doing this and I can't see anywhere that
it can be coerced to do so.

Have I missed something? Can anyone suggest how this could be achieved? It
wouldn't be *that* hard to code up something that essentially operates in
the same way as caret.train but can handle this feature for bagging models,
but if it is already there and I've missed something please let me know.

Thanks.
Matt Francis

	[[alternative HTML version deleted]]

Max Kuhn

2012-Apr-13 16:53 UTC

head link

[R] caret package: custom summary function in trainControl doesn't work with oob?

Matt,
> I've been using a custom summary function to optimise regression model
> methods using the caret package. This has worked smoothly. I've been
using
> the default bootstrapping resampling method. For bagging models
> (specifically randomForest in this case) caret can, in theory, uses the
> out-of-bag (oob) error estimate from the model instead of resampling, which
> (in theory) is largely redundant for such models. Since they take a while
> to build in the first place, it really slows things down when estimating
> performance using boostrap.
>
> I can successfully run either using the oob 'resampling method'
with the
> default RMSE optimisation, or run using bootstrap and my custom
> summaryFunction as the thing to optimise, but they don't work together.
If
> I try and use oob and supply a summaryFunction caret throws an error saying
> it can't find the relevant metric.
>
> Now, if caret is simply polling the randomForest object for the stored oob
> error I can understand this limitation
That is exactly what it does. See caret:::rfStats (not a public function)

train() was written to be fairly general and this level of control
would be very difficult to implement, especially since each model that
does some type of bagging uses different internal structures etc.
> but in the case of randomForest
> (and probably other bagging methods?) the training function can be asked to
> return information about the individual tree predictions and whether data
> points were oob in each case. With this information you can reconstruct an
> oob 'error' using whatever function you choose to target for
optimisation.
> As far as I can tell, caret is not doing this and I can't see anywhere
that
> it can be coerced to do so.
It will not be able to do this. I'm not sure that you can either.
randomForest() will return the individual forests and
predict.randomForest() can return the per-tree results but I don't
know if it saves the indices that tell you which bootstrap samples
contained which training set points. Perhaps Andy would know.
> Have I missed something? Can anyone suggest how this could be achieved? It
> wouldn't be *that* hard to code up something that essentially operates
in
> the same way as caret.train but can handle this feature for bagging models,
> but if it is already there and I've missed something please let me
know.
Well, everything is easy for the person not doing it =]

If you save the proximity measures, you might gain the sampling
indices. WIth these, you would use predict.randomForest(...,
predict.all=TRUE) to get the individual predictions.

Max

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Apr 2012 - caret package: custom summary function in trainControl doesn't work with oob?

[R] caret package: custom summary function in trainControl doesn't work with oob?

[R] caret package: custom summary function in trainControl doesn't work with oob?

Seemingly Similar Threads