Liaw, Andy
2005-Oct-27 16:37 UTC
[R] Repost: Examples of "classwt", "strata", and "sampsize" i n randomForest?
"classwt" in the current version of the randomForest package doesn't work too well. (It's what was in version 3.x of the original Fortran code by Breiman and Cutler, not the one in the new Fortran code.) I'd advise against using it. "sampsize" and "strata" can be use in conjunction. If "strata" is not specified, the class labels will be used. Take the iris data as an example: randomForest(Species ~ ., iris, sampsize=c(10, 30, 10)) says to randomly draw 10, 30 and 10 from the three species (with replacement) to grow each tree. If you are unsure of the labels, use named vector, e.g., randomForest(Species ~ ., iris, sampsize=c(setosa=10, versicolor=30, virginica=10)) Now, if you want the stratified sampling to be done using a different variable than the class labels; e.g., for multi-centered clinical trial data, you want to draw the same number of patients per center to grow each tree (I'm just making things up, not that that necessarily makes any sense), you can do something like: randomForest(..., strata=center, sampsize=rep(min(table(center))), nlevels(center))) which draws the same number of patients (minimum at any center) from each center to grow each tree. Hope that's clear. Eventually all such things will be in the yet to be written package vignette... Andy> From: David L. Van Brunt, Ph.D. > > I have read both the help files and that article... the > article very nicely > evaluates the value of dealing with unbalanced data, and the > help files show > that you can, but offer no guidance in terms of how the > syntax should be > specified. The "strata" and "classwt" clearly can be > specified, but it's not > shown how to specify the values... > > The examples do not include specifications of those terms, > and every guess > I've made has generated an error.... > > > On 10/27/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: > > > > See > > http://finzi.psych.upenn.edu/R/Rhelp02a/archive/40898.html > > > > On 10/27/05, David L. Van Brunt, Ph.D. <dlvanbrunt at gmail.com> wrote: > > > Sorry for the repost, but I've really been looking, and > can't find any > > > syntax direction on this issue... > > > > > > Just browsing the documentation, and searching the list > came up short... > > I > > > have some unbalanced data and was wondering if, in a "0" v "1" > > > classification forest, some combo of these options might > yield better > > > predictions when the proportion of one class is low (less > than 10% in a > > > sample of 2,000 observations). > > > > > > Not sure how to specify these terms... from the docs, we have: > > > > > > classwt: Priors of the classes. Need not add up to one. > Ignored for > > > regression. > > > > > > So is this something like "... classwt=c(.90,.10)" ? I > didn't see the > > syntax > > > demonstrated. Similar for "strata" and "sampsize" though > there is a > > default > > > for sampsize that makes sense... not sure how you would > make "a vector > > of > > > the length the number of strata", however.... > > > > > > Pointers? > > > > > > -- > > > --------------------------------------- > > > David L. Van Brunt, Ph.D. > > > mailto:dlvanbrunt at gmail.com > > > > > > -- > > > --------------------------------------- > > > David L. Van Brunt, Ph.D. > > > mailto:dlvanbrunt at gmail.com > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > > -- > --------------------------------------- > David L. Van Brunt, Ph.D. > mailto:dlvanbrunt at gmail.com > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
David L. Van Brunt, Ph.D.
2005-Oct-27 17:37 UTC
[R] Repost: Examples of "classwt", "strata", and "sampsize" i n randomForest?
Perfect! More useful than I was even hoping for. Great help, many thanks! On 10/27/05, Liaw, Andy <andy_liaw@merck.com> wrote:> > "classwt" in the current version of the randomForest package doesn't work > too well. (It's what was in version 3.x of the original Fortran code by > Breiman and Cutler, not the one in the new Fortran code.) I'd advise > against using it. > > "sampsize" and "strata" can be use in conjunction. If "strata" is not > specified, the class labels will be used. Take the iris data as an > example: > > randomForest(Species ~ ., iris, sampsize=c(10, 30, 10)) > > says to randomly draw 10, 30 and 10 from the three species (with > replacement) to grow each tree. If you are unsure of the labels, use named > vector, e.g., > > randomForest(Species ~ ., iris, > sampsize=c(setosa=10, versicolor=30, virginica=10)) > > Now, if you want the stratified sampling to be done using a different > variable than the class labels; e.g., for multi-centered clinical trial > data, you want to draw the same number of patients per center to grow each > tree (I'm just making things up, not that that necessarily makes any > sense), > you can do something like: > > randomForest(..., strata=center, > sampsize=rep(min(table(center))), nlevels(center))) > > which draws the same number of patients (minimum at any center) from each > center to grow each tree. > > Hope that's clear. Eventually all such things will be in the yet to be > written package vignette... > > Andy > > > > From: David L. Van Brunt, Ph.D. > > > > I have read both the help files and that article... the > > article very nicely > > evaluates the value of dealing with unbalanced data, and the > > help files show > > that you can, but offer no guidance in terms of how the > > syntax should be > > specified. The "strata" and "classwt" clearly can be > > specified, but it's not > > shown how to specify the values... > > > > The examples do not include specifications of those terms, > > and every guess > > I've made has generated an error.... > > > > > > On 10/27/05, Gabor Grothendieck <ggrothendieck@gmail.com> wrote: > > > > > > See > > > http://finzi.psych.upenn.edu/R/Rhelp02a/archive/40898.html > > > > > > On 10/27/05, David L. Van Brunt, Ph.D. <dlvanbrunt@gmail.com> wrote: > > > > Sorry for the repost, but I've really been looking, and > > can't find any > > > > syntax direction on this issue... > > > > > > > > Just browsing the documentation, and searching the list > > came up short... > > > I > > > > have some unbalanced data and was wondering if, in a "0" v "1" > > > > classification forest, some combo of these options might > > yield better > > > > predictions when the proportion of one class is low (less > > than 10% in a > > > > sample of 2,000 observations). > > > > > > > > Not sure how to specify these terms... from the docs, we have: > > > > > > > > classwt: Priors of the classes. Need not add up to one. > > Ignored for > > > > regression. > > > > > > > > So is this something like "... classwt=c(.90,.10)" ? I > > didn't see the > > > syntax > > > > demonstrated. Similar for "strata" and "sampsize" though > > there is a > > > default > > > > for sampsize that makes sense... not sure how you would > > make "a vector > > > of > > > > the length the number of strata", however.... > > > > > > > > Pointers? > > > > > > > > -- > > > > --------------------------------------- > > > > David L. Van Brunt, Ph.D. > > > > mailto:dlvanbrunt@gmail.com > > > > > > > > -- > > > > --------------------------------------- > > > > David L. Van Brunt, Ph.D. > > > > mailto:dlvanbrunt@gmail.com > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help@stat.math.ethz.ch mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > > > -- > > --------------------------------------- > > David L. Van Brunt, Ph.D. > > mailto:dlvanbrunt@gmail.com > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachments...{{dropped}}