thr3ads.net - R devel - [Rd] informal conventions/checklist for new predictive modeling packages [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Max Kuhn

2012-Jan-04 14:19 UTC

[Rd] informal conventions/checklist for new predictive modeling packages

Working on the caret package has exposed me to the wide variety of
approaches that different authors have taken to creating predictive
modeling functions (aka machine learning)(aka pattern recognition).

I suspect that many package authors are neophyte R users and are
stumbling through the process of writing their first R package (or R
code). As such, they may not have been exposed to some of the informal
conventions that have evolved over time. Also, their package may be
intended to demonstrate their research and not for "production"
modeling. In any case, it might be a good idea to print up a few
points for consideration when creating a predictive modeling package.
I don't propose changes to existing code.

Some of this is obvious and not limited to this class of modeling
packages. Many of these points are arguable, so please do so.

If this seems useful, perhaps we could repost the final list to R-Help
to use as a checklist.

Those of you who have used my code will probably realize that I am not
a grand architect of R packages =] I'd love to get feedback from those
of you with a broader perspective and better software engineering
skills than I (a low bar to step over).

I have marked a few of these items with an OCD tag since I might be
taking it a bit too far.

The list:

(1) Extend the work of others. There is an amazing amount of unneeded
redundancy. There are plenty of times that users implement their own
version of a function because there is an missing feature, but a lot
of time is spent re-creating duplicate functions. For example, kernlab
has an excellent set of kernel functions that are really efficient and
have useful ancillary functions. People may not new aware of these
functions, but they are one RSiteSearch away. (Perhaps we could
nominate a few packages like kernlab that implement a specific tool
well)

(2) When modeling a categorical outcome, use a factor as input (as
opposed to 0/1 indicators or integers). Factors are exactly the kind
of feature that separates R from other languages (I'm looking at you
SAS) and is a natural structure for this data type.

corollary (2a): save the factor levels in the model object somewhere

corollary (2b): return predicted classes as factors with the same
levels (and ordering of levels).

(3) Implement a separate prediction function. Some packages only make
predictions when the model is built, so effectively the model cannot
be used at any point in the future.

corollary (3a): use object-orientation (eg. predict.{class}) and not
some made-up function name "modelPredict()" for predicting new
samples.

(4) If the method only accepts a specific type of input (eg. matrix or
data frame), please do the conversion whenever appropriate.

(5) Provide a formula interface (eg. foo(y~x, data = dat)) and
non-formula interface (foo(x, y) to the function. Formula methods are
really inefficient at this time for large dimensional data but are
fantastically convenient. There are some good reasons to not use
formulas, such as functions that do not use a design matrix (eg.
cforest()) or need factors to be handled in a non-standard way (eg.
cubist()).

(6) Don't require a test set when model building.

(7) Control all written output during model-building time with a
verbose option. Resampling can make a mess out of things if
output/logging is always exposed.

(8) Please use RSiteSearch to avoid name collisions between packages
(eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.

(9) Allow the predict function to generate results from many different
sub-models simultaneously. For example, pls() can return predictions
across many values of ncomp. enet(), cubist(), blackboost() are other
examples.

corollary (9a): [OCD] ensure the same object type for predictions.
There are occasions where predict() will return a vector or a matrix
depending on the context. I would argue that this is not optimal.

(10) Use a limited vocabulary for options. For example, some predict()
functions have a "type" options to switch between predicted classes
and class probabilities. Values of "type" pertaining to class
probabilities range from "prob", "probability",
"posterior", "raw",
"response", etc. I'll make a suggestion of "prob" as a
possible
standard for this situation.

(11) Make sure that class probabilities sum to one. Seriously.

(12) If the model implicitly conducts feature selection, do not
require un-used predictors to be present in future data sets for
prediction. This may be a problem when the formula interface to models
is used, but it looks like many functions reference columns by
position and not name.

(13) Packages that have their own cross-validation functions should
allow the users to pass in the specific folds/resamping indicators to
maintain consistency across similar functions in other packages.

(14) [OCD] For binary classification models, model the probability of
the first level of a factor as the event of interest (again, for
consistency) Note that glm() does not do this but most others use the
first level.

Thanks,

Max

Liaw, Andy

2012-Jan-05 13:34 UTC

head link

[Rd] informal conventions/checklist for new predictive modeling packages

From: Max Kuhn> 
> Working on the caret package has exposed me to the wide variety of
> approaches that different authors have taken to creating predictive
> modeling functions (aka machine learning)(aka pattern recognition).
> 
> I suspect that many package authors are neophyte R users and are
> stumbling through the process of writing their first R package (or R
> code). As such, they may not have been exposed to some of the informal
> conventions that have evolved over time. Also, their package may be
> intended to demonstrate their research and not for "production"
> modeling. In any case, it might be a good idea to print up a few
> points for consideration when creating a predictive modeling package.
> I don't propose changes to existing code.
> 
> Some of this is obvious and not limited to this class of modeling
> packages. Many of these points are arguable, so please do so.
> 
> If this seems useful, perhaps we could repost the final list to R-Help
> to use as a checklist.
I think this is great, Max!  May I suggest that a "standard" be put
together
(with concensus of many ML package authors), and packages that conform to
the standard are marked as such in the ML task view?

Andy

> Those of you who have used my code will probably realize that I am not
> a grand architect of R packages =] I'd love to get feedback from those
> of you with a broader perspective and better software engineering
> skills than I (a low bar to step over).
> 
> I have marked a few of these items with an OCD tag since I might be
> taking it a bit too far.
> 
> The list:
> 
> (1) Extend the work of others. There is an amazing amount of unneeded
> redundancy. There are plenty of times that users implement their own
> version of a function because there is an missing feature, but a lot
> of time is spent re-creating duplicate functions. For example, kernlab
> has an excellent set of kernel functions that are really efficient and
> have useful ancillary functions. People may not new aware of these
> functions, but they are one RSiteSearch away. (Perhaps we could
> nominate a few packages like kernlab that implement a specific tool
> well)
> 
> (2) When modeling a categorical outcome, use a factor as input (as
> opposed to 0/1 indicators or integers). Factors are exactly the kind
> of feature that separates R from other languages (I'm looking at you
> SAS) and is a natural structure for this data type.
> 
> corollary (2a): save the factor levels in the model object somewhere
> 
> corollary (2b): return predicted classes as factors with the same
> levels (and ordering of levels).
> 
> (3) Implement a separate prediction function. Some packages only make
> predictions when the model is built, so effectively the model cannot
> be used at any point in the future.
> 
> corollary (3a): use object-orientation (eg. predict.{class}) and not
> some made-up function name "modelPredict()" for predicting new
> samples.
> 
> (4) If the method only accepts a specific type of input (eg. matrix or
> data frame), please do the conversion whenever appropriate.
> 
> (5) Provide a formula interface (eg. foo(y~x, data = dat)) and
> non-formula interface (foo(x, y) to the function. Formula methods are
> really inefficient at this time for large dimensional data but are
> fantastically convenient. There are some good reasons to not use
> formulas, such as functions that do not use a design matrix (eg.
> cforest()) or need factors to be handled in a non-standard way (eg.
> cubist()).
> 
> (6) Don't require a test set when model building.
> 
> (7) Control all written output during model-building time with a
> verbose option. Resampling can make a mess out of things if
> output/logging is always exposed.
> 
> (8) Please use RSiteSearch to avoid name collisions between packages
> (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.
> 
> (9) Allow the predict function to generate results from many different
> sub-models simultaneously. For example, pls() can return predictions
> across many values of ncomp. enet(), cubist(), blackboost() are other
> examples.
> 
> corollary (9a): [OCD] ensure the same object type for predictions.
> There are occasions where predict() will return a vector or a matrix
> depending on the context. I would argue that this is not optimal.
> 
> (10) Use a limited vocabulary for options. For example, some predict()
> functions have a "type" options to switch between predicted
classes
> and class probabilities. Values of "type" pertaining to class
> probabilities range from "prob", "probability",
"posterior", "raw",
> "response", etc. I'll make a suggestion of "prob"
as a possible
> standard for this situation.
> 
> (11) Make sure that class probabilities sum to one. Seriously.
> 
> (12) If the model implicitly conducts feature selection, do not
> require un-used predictors to be present in future data sets for
> prediction. This may be a problem when the formula interface to models
> is used, but it looks like many functions reference columns by
> position and not name.
> 
> (13) Packages that have their own cross-validation functions should
> allow the users to pass in the specific folds/resamping indicators to
> maintain consistency across similar functions in other packages.
> 
> (14) [OCD] For binary classification models, model the probability of
> the first level of a factor as the event of interest (again, for
> consistency) Note that glm() does not do this but most others use the
> first level.
> 
> Thanks,
> 
> Max
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> Notice:  This e-mail message, together with any attachme...{{dropped:11}}

Steve Lianoglou

2012-Jan-05 15:16 UTC

head link

[Rd] informal conventions/checklist for new predictive modeling packages

Good stuff, Max!

Would also be nice to nail your 14 theses to a more permanent wall
than the r-help mailing list ... not sure where that would be, though
... isn't someone supposed to be redesigning the r-project.org
website? [I jest, I jest] More seriously, though, it might be worth
linking to from the developer.r-project.org site as well as from some
blurb in the header of the ML task view.


-steve

On Wed, Jan 4, 2012 at 9:19 AM, Max Kuhn <mxkuhn at gmail.com>
wrote:> Working on the caret package has exposed me to the wide variety of
> approaches that different authors have taken to creating predictive
> modeling functions (aka machine learning)(aka pattern recognition).
>
> I suspect that many package authors are neophyte R users and are
> stumbling through the process of writing their first R package (or R
> code). As such, they may not have been exposed to some of the informal
> conventions that have evolved over time. Also, their package may be
> intended to demonstrate their research and not for "production"
> modeling. In any case, it might be a good idea to print up a few
> points for consideration when creating a predictive modeling package.
> I don't propose changes to existing code.
>
> Some of this is obvious and not limited to this class of modeling
> packages. Many of these points are arguable, so please do so.
>
> If this seems useful, perhaps we could repost the final list to R-Help
> to use as a checklist.
>
> Those of you who have used my code will probably realize that I am not
> a grand architect of R packages =] I'd love to get feedback from those
> of you with a broader perspective and better software engineering
> skills than I (a low bar to step over).
>
> I have marked a few of these items with an OCD tag since I might be
> taking it a bit too far.
>
> The list:
>
> (1) Extend the work of others. There is an amazing amount of unneeded
> redundancy. There are plenty of times that users implement their own
> version of a function because there is an missing feature, but a lot
> of time is spent re-creating duplicate functions. For example, kernlab
> has an excellent set of kernel functions that are really efficient and
> have useful ancillary functions. People may not new aware of these
> functions, but they are one RSiteSearch away. (Perhaps we could
> nominate a few packages like kernlab that implement a specific tool
> well)
>
> (2) When modeling a categorical outcome, use a factor as input (as
> opposed to 0/1 indicators or integers). Factors are exactly the kind
> of feature that separates R from other languages (I'm looking at you
> SAS) and is a natural structure for this data type.
>
> corollary (2a): save the factor levels in the model object somewhere
>
> corollary (2b): return predicted classes as factors with the same
> levels (and ordering of levels).
>
> (3) Implement a separate prediction function. Some packages only make
> predictions when the model is built, so effectively the model cannot
> be used at any point in the future.
>
> corollary (3a): use object-orientation (eg. predict.{class}) and not
> some made-up function name "modelPredict()" for predicting new
> samples.
>
> (4) If the method only accepts a specific type of input (eg. matrix or
> data frame), please do the conversion whenever appropriate.
>
> (5) Provide a formula interface (eg. foo(y~x, data = dat)) and
> non-formula interface (foo(x, y) to the function. Formula methods are
> really inefficient at this time for large dimensional data but are
> fantastically convenient. There are some good reasons to not use
> formulas, such as functions that do not use a design matrix (eg.
> cforest()) or need factors to be handled in a non-standard way (eg.
> cubist()).
>
> (6) Don't require a test set when model building.
>
> (7) Control all written output during model-building time with a
> verbose option. Resampling can make a mess out of things if
> output/logging is always exposed.
>
> (8) Please use RSiteSearch to avoid name collisions between packages
> (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor.
>
> (9) Allow the predict function to generate results from many different
> sub-models simultaneously. For example, pls() can return predictions
> across many values of ncomp. enet(), cubist(), blackboost() are other
> examples.
>
> corollary (9a): [OCD] ensure the same object type for predictions.
> There are occasions where predict() will return a vector or a matrix
> depending on the context. I would argue that this is not optimal.
>
> (10) Use a limited vocabulary for options. For example, some predict()
> functions have a "type" options to switch between predicted
classes
> and class probabilities. Values of "type" pertaining to class
> probabilities range from "prob", "probability",
"posterior", "raw",
> "response", etc. I'll make a suggestion of "prob"
as a possible
> standard for this situation.
>
> (11) Make sure that class probabilities sum to one. Seriously.
>
> (12) If the model implicitly conducts feature selection, do not
> require un-used predictors to be present in future data sets for
> prediction. This may be a problem when the formula interface to models
> is used, but it looks like many functions reference columns by
> position and not name.
>
> (13) Packages that have their own cross-validation functions should
> allow the users to pass in the specific folds/resamping indicators to
> maintain consistency across similar functions in other packages.
>
> (14) [OCD] For binary classification models, model the probability of
> the first level of a factor as the event of interest (again, for
> consistency) Note that glm() does not do this but most others use the
> first level.
>
> Thanks,
>
> Max
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Paul Johnson

2012-Jan-05 20:44 UTC

head link

[Rd] informal conventions/checklist for new predictive modeling packages

I agree with almost all, except the last point. Since I have
participated in wheel-reinvention lately, I agree with the bulk of
your comment. I don't think the fix is as easy as you suspect,
RSiteSearch won't help me find a function I need when I don't know the
magic words.  Some R functions have such unexpected names that only a
fastidious source-code reader would find them ("pretty", for example).
 But I agree with your concern.

But, as far as the last one is concerned, I think you are mistaken.
Explanation below.

On Wed, Jan 4, 2012 at 8:19 AM, Max Kuhn <mxkuhn at gmail.com>
wrote:>
> (14) [OCD] For binary classification models, model the probability of
> the first level of a factor as the event of interest (again, for
> consistency) Note that glm() does not do this but most others use the
> first level.
>When the DV is thought of as 0 and 1, and 1 is an "event"
"success" or
"win" and 0 is a "non event" "failure" or
"loss",  if there is to be a
single predicted probability, I want it to be the probability of the
higher outcome.

glm is doing the thing I want, and I don't know of others that go the
other way, except PROC LOGISTIC in SAS.  And that has a long history
of causing confusion and despair.

I'd like to consider adding one thing to your list, though.  I have
wished (in this list and elsewhere) that there were a more regular
approach for calculating "newdata" objects that are used in predict.
Many packages have re-invented this (datadist in rms, effects), and
almost nobody here agreed with my wish for a more standard approach.
But if there were a standard approach, it would be much easier to hold
up R as an alternative to Stata when users pop up with "marginal
effects tables" from Stata that are very difficult to reproduce with
R.

Regards,
pj

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Jan 2012 - informal conventions/checklist for new predictive modeling packages

[Rd] informal conventions/checklist for new predictive modeling packages

[Rd] informal conventions/checklist for new predictive modeling packages

[Rd] informal conventions/checklist for new predictive modeling packages

[Rd] informal conventions/checklist for new predictive modeling packages

Seemingly Similar Threads