Max Kuhn
2012-Jan-04 14:19 UTC
[Rd] informal conventions/checklist for new predictive modeling packages
Working on the caret package has exposed me to the wide variety of approaches that different authors have taken to creating predictive modeling functions (aka machine learning)(aka pattern recognition). I suspect that many package authors are neophyte R users and are stumbling through the process of writing their first R package (or R code). As such, they may not have been exposed to some of the informal conventions that have evolved over time. Also, their package may be intended to demonstrate their research and not for "production" modeling. In any case, it might be a good idea to print up a few points for consideration when creating a predictive modeling package. I don't propose changes to existing code. Some of this is obvious and not limited to this class of modeling packages. Many of these points are arguable, so please do so. If this seems useful, perhaps we could repost the final list to R-Help to use as a checklist. Those of you who have used my code will probably realize that I am not a grand architect of R packages =] I'd love to get feedback from those of you with a broader perspective and better software engineering skills than I (a low bar to step over). I have marked a few of these items with an OCD tag since I might be taking it a bit too far. The list: (1) Extend the work of others. There is an amazing amount of unneeded redundancy. There are plenty of times that users implement their own version of a function because there is an missing feature, but a lot of time is spent re-creating duplicate functions. For example, kernlab has an excellent set of kernel functions that are really efficient and have useful ancillary functions. People may not new aware of these functions, but they are one RSiteSearch away. (Perhaps we could nominate a few packages like kernlab that implement a specific tool well) (2) When modeling a categorical outcome, use a factor as input (as opposed to 0/1 indicators or integers). Factors are exactly the kind of feature that separates R from other languages (I'm looking at you SAS) and is a natural structure for this data type. corollary (2a): save the factor levels in the model object somewhere corollary (2b): return predicted classes as factors with the same levels (and ordering of levels). (3) Implement a separate prediction function. Some packages only make predictions when the model is built, so effectively the model cannot be used at any point in the future. corollary (3a): use object-orientation (eg. predict.{class}) and not some made-up function name "modelPredict()" for predicting new samples. (4) If the method only accepts a specific type of input (eg. matrix or data frame), please do the conversion whenever appropriate. (5) Provide a formula interface (eg. foo(y~x, data = dat)) and non-formula interface (foo(x, y) to the function. Formula methods are really inefficient at this time for large dimensional data but are fantastically convenient. There are some good reasons to not use formulas, such as functions that do not use a design matrix (eg. cforest()) or need factors to be handled in a non-standard way (eg. cubist()). (6) Don't require a test set when model building. (7) Control all written output during model-building time with a verbose option. Resampling can make a mess out of things if output/logging is always exposed. (8) Please use RSiteSearch to avoid name collisions between packages (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor. (9) Allow the predict function to generate results from many different sub-models simultaneously. For example, pls() can return predictions across many values of ncomp. enet(), cubist(), blackboost() are other examples. corollary (9a): [OCD] ensure the same object type for predictions. There are occasions where predict() will return a vector or a matrix depending on the context. I would argue that this is not optimal. (10) Use a limited vocabulary for options. For example, some predict() functions have a "type" options to switch between predicted classes and class probabilities. Values of "type" pertaining to class probabilities range from "prob", "probability", "posterior", "raw", "response", etc. I'll make a suggestion of "prob" as a possible standard for this situation. (11) Make sure that class probabilities sum to one. Seriously. (12) If the model implicitly conducts feature selection, do not require un-used predictors to be present in future data sets for prediction. This may be a problem when the formula interface to models is used, but it looks like many functions reference columns by position and not name. (13) Packages that have their own cross-validation functions should allow the users to pass in the specific folds/resamping indicators to maintain consistency across similar functions in other packages. (14) [OCD] For binary classification models, model the probability of the first level of a factor as the event of interest (again, for consistency) Note that glm() does not do this but most others use the first level. Thanks, Max
Liaw, Andy
2012-Jan-05 13:34 UTC
[Rd] informal conventions/checklist for new predictive modeling packages
From: Max Kuhn> > Working on the caret package has exposed me to the wide variety of > approaches that different authors have taken to creating predictive > modeling functions (aka machine learning)(aka pattern recognition). > > I suspect that many package authors are neophyte R users and are > stumbling through the process of writing their first R package (or R > code). As such, they may not have been exposed to some of the informal > conventions that have evolved over time. Also, their package may be > intended to demonstrate their research and not for "production" > modeling. In any case, it might be a good idea to print up a few > points for consideration when creating a predictive modeling package. > I don't propose changes to existing code. > > Some of this is obvious and not limited to this class of modeling > packages. Many of these points are arguable, so please do so. > > If this seems useful, perhaps we could repost the final list to R-Help > to use as a checklist.I think this is great, Max! May I suggest that a "standard" be put together (with concensus of many ML package authors), and packages that conform to the standard are marked as such in the ML task view? Andy> Those of you who have used my code will probably realize that I am not > a grand architect of R packages =] I'd love to get feedback from those > of you with a broader perspective and better software engineering > skills than I (a low bar to step over). > > I have marked a few of these items with an OCD tag since I might be > taking it a bit too far. > > The list: > > (1) Extend the work of others. There is an amazing amount of unneeded > redundancy. There are plenty of times that users implement their own > version of a function because there is an missing feature, but a lot > of time is spent re-creating duplicate functions. For example, kernlab > has an excellent set of kernel functions that are really efficient and > have useful ancillary functions. People may not new aware of these > functions, but they are one RSiteSearch away. (Perhaps we could > nominate a few packages like kernlab that implement a specific tool > well) > > (2) When modeling a categorical outcome, use a factor as input (as > opposed to 0/1 indicators or integers). Factors are exactly the kind > of feature that separates R from other languages (I'm looking at you > SAS) and is a natural structure for this data type. > > corollary (2a): save the factor levels in the model object somewhere > > corollary (2b): return predicted classes as factors with the same > levels (and ordering of levels). > > (3) Implement a separate prediction function. Some packages only make > predictions when the model is built, so effectively the model cannot > be used at any point in the future. > > corollary (3a): use object-orientation (eg. predict.{class}) and not > some made-up function name "modelPredict()" for predicting new > samples. > > (4) If the method only accepts a specific type of input (eg. matrix or > data frame), please do the conversion whenever appropriate. > > (5) Provide a formula interface (eg. foo(y~x, data = dat)) and > non-formula interface (foo(x, y) to the function. Formula methods are > really inefficient at this time for large dimensional data but are > fantastically convenient. There are some good reasons to not use > formulas, such as functions that do not use a design matrix (eg. > cforest()) or need factors to be handled in a non-standard way (eg. > cubist()). > > (6) Don't require a test set when model building. > > (7) Control all written output during model-building time with a > verbose option. Resampling can make a mess out of things if > output/logging is always exposed. > > (8) Please use RSiteSearch to avoid name collisions between packages > (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor. > > (9) Allow the predict function to generate results from many different > sub-models simultaneously. For example, pls() can return predictions > across many values of ncomp. enet(), cubist(), blackboost() are other > examples. > > corollary (9a): [OCD] ensure the same object type for predictions. > There are occasions where predict() will return a vector or a matrix > depending on the context. I would argue that this is not optimal. > > (10) Use a limited vocabulary for options. For example, some predict() > functions have a "type" options to switch between predicted classes > and class probabilities. Values of "type" pertaining to class > probabilities range from "prob", "probability", "posterior", "raw", > "response", etc. I'll make a suggestion of "prob" as a possible > standard for this situation. > > (11) Make sure that class probabilities sum to one. Seriously. > > (12) If the model implicitly conducts feature selection, do not > require un-used predictors to be present in future data sets for > prediction. This may be a problem when the formula interface to models > is used, but it looks like many functions reference columns by > position and not name. > > (13) Packages that have their own cross-validation functions should > allow the users to pass in the specific folds/resamping indicators to > maintain consistency across similar functions in other packages. > > (14) [OCD] For binary classification models, model the probability of > the first level of a factor as the event of interest (again, for > consistency) Note that glm() does not do this but most others use the > first level. > > Thanks, > > Max > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >Notice: This e-mail message, together with any attachme...{{dropped:11}}
Steve Lianoglou
2012-Jan-05 15:16 UTC
[Rd] informal conventions/checklist for new predictive modeling packages
Good stuff, Max! Would also be nice to nail your 14 theses to a more permanent wall than the r-help mailing list ... not sure where that would be, though ... isn't someone supposed to be redesigning the r-project.org website? [I jest, I jest] More seriously, though, it might be worth linking to from the developer.r-project.org site as well as from some blurb in the header of the ML task view. -steve On Wed, Jan 4, 2012 at 9:19 AM, Max Kuhn <mxkuhn at gmail.com> wrote:> Working on the caret package has exposed me to the wide variety of > approaches that different authors have taken to creating predictive > modeling functions (aka machine learning)(aka pattern recognition). > > I suspect that many package authors are neophyte R users and are > stumbling through the process of writing their first R package (or R > code). As such, they may not have been exposed to some of the informal > conventions that have evolved over time. Also, their package may be > intended to demonstrate their research and not for "production" > modeling. In any case, it might be a good idea to print up a few > points for consideration when creating a predictive modeling package. > I don't propose changes to existing code. > > Some of this is obvious and not limited to this class of modeling > packages. Many of these points are arguable, so please do so. > > If this seems useful, perhaps we could repost the final list to R-Help > to use as a checklist. > > Those of you who have used my code will probably realize that I am not > a grand architect of R packages =] I'd love to get feedback from those > of you with a broader perspective and better software engineering > skills than I (a low bar to step over). > > I have marked a few of these items with an OCD tag since I might be > taking it a bit too far. > > The list: > > (1) Extend the work of others. There is an amazing amount of unneeded > redundancy. There are plenty of times that users implement their own > version of a function because there is an missing feature, but a lot > of time is spent re-creating duplicate functions. For example, kernlab > has an excellent set of kernel functions that are really efficient and > have useful ancillary functions. People may not new aware of these > functions, but they are one RSiteSearch away. (Perhaps we could > nominate a few packages like kernlab that implement a specific tool > well) > > (2) When modeling a categorical outcome, use a factor as input (as > opposed to 0/1 indicators or integers). Factors are exactly the kind > of feature that separates R from other languages (I'm looking at you > SAS) and is a natural structure for this data type. > > corollary (2a): save the factor levels in the model object somewhere > > corollary (2b): return predicted classes as factors with the same > levels (and ordering of levels). > > (3) Implement a separate prediction function. Some packages only make > predictions when the model is built, so effectively the model cannot > be used at any point in the future. > > corollary (3a): use object-orientation (eg. predict.{class}) and not > some made-up function name "modelPredict()" for predicting new > samples. > > (4) If the method only accepts a specific type of input (eg. matrix or > data frame), please do the conversion whenever appropriate. > > (5) Provide a formula interface (eg. foo(y~x, data = dat)) and > non-formula interface (foo(x, y) to the function. Formula methods are > really inefficient at this time for large dimensional data but are > fantastically convenient. There are some good reasons to not use > formulas, such as functions that do not use a design matrix (eg. > cforest()) or need factors to be handled in a non-standard way (eg. > cubist()). > > (6) Don't require a test set when model building. > > (7) Control all written output during model-building time with a > verbose option. Resampling can make a mess out of things if > output/logging is always exposed. > > (8) Please use RSiteSearch to avoid name collisions between packages > (eg. gam(), splsda(), roc(), LogitBoost()). Also search Bioconductor. > > (9) Allow the predict function to generate results from many different > sub-models simultaneously. For example, pls() can return predictions > across many values of ncomp. enet(), cubist(), blackboost() are other > examples. > > corollary (9a): [OCD] ensure the same object type for predictions. > There are occasions where predict() will return a vector or a matrix > depending on the context. I would argue that this is not optimal. > > (10) Use a limited vocabulary for options. For example, some predict() > functions have a "type" options to switch between predicted classes > and class probabilities. Values of "type" pertaining to class > probabilities range from "prob", "probability", "posterior", "raw", > "response", etc. I'll make a suggestion of "prob" as a possible > standard for this situation. > > (11) Make sure that class probabilities sum to one. Seriously. > > (12) If the model implicitly conducts feature selection, do not > require un-used predictors to be present in future data sets for > prediction. This may be a problem when the formula interface to models > is used, but it looks like many functions reference columns by > position and not name. > > (13) Packages that have their own cross-validation functions should > allow the users to pass in the specific folds/resamping indicators to > maintain consistency across similar functions in other packages. > > (14) [OCD] For binary classification models, model the probability of > the first level of a factor as the event of interest (again, for > consistency) Note that glm() does not do this but most others use the > first level. > > Thanks, > > Max > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Paul Johnson
2012-Jan-05 20:44 UTC
[Rd] informal conventions/checklist for new predictive modeling packages
I agree with almost all, except the last point. Since I have participated in wheel-reinvention lately, I agree with the bulk of your comment. I don't think the fix is as easy as you suspect, RSiteSearch won't help me find a function I need when I don't know the magic words. Some R functions have such unexpected names that only a fastidious source-code reader would find them ("pretty", for example). But I agree with your concern. But, as far as the last one is concerned, I think you are mistaken. Explanation below. On Wed, Jan 4, 2012 at 8:19 AM, Max Kuhn <mxkuhn at gmail.com> wrote:> > (14) [OCD] For binary classification models, model the probability of > the first level of a factor as the event of interest (again, for > consistency) Note that glm() does not do this but most others use the > first level. >When the DV is thought of as 0 and 1, and 1 is an "event" "success" or "win" and 0 is a "non event" "failure" or "loss", if there is to be a single predicted probability, I want it to be the probability of the higher outcome. glm is doing the thing I want, and I don't know of others that go the other way, except PROC LOGISTIC in SAS. And that has a long history of causing confusion and despair. I'd like to consider adding one thing to your list, though. I have wished (in this list and elsewhere) that there were a more regular approach for calculating "newdata" objects that are used in predict. Many packages have re-invented this (datadist in rms, effects), and almost nobody here agreed with my wish for a more standard approach. But if there were a standard approach, it would be much easier to hold up R as an alternative to Stata when users pop up with "marginal effects tables" from Stata that are very difficult to reproduce with R. Regards, pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas