Forum, I'm a grad student in Civil Eng, took some Stats classes that required students learn R, and I have since taken to R and use it for as much as I can. Back in my lab/office, many of my fellow grad students still use proprietary software at the behest of advisers who are familiar with the recommended software (Statistica, @Risk (Excel Add-on), etc). I have spent a lot of time learning R and am confident it can generally out-process, out-graph, or more simply stated, out-perform most of these other software packages. However, one area my view has been humbled in is distribution fitting. I started by reading through http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf After that I started digging around on this forum and found posts like this one http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000 that are close to what I'm after. That is, given an observation dataset, I would like to call a function that cycles through numerous distributions (common or not) and then ranks them for me based on Chi-Square, Kolmogorov-Smirnov and/or Anderson-Darling, for example. This question was asked back in 2004: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response was that this kind of thing wasn't in R nor in proprietary software to the best of the responding author's memory. In 2010, however, this is no longer true as @Risk's (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg) "Distribution Fitting" function does this very thing. And it is here that my R pride has taken a hit. Based on the first response to the question posed here http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448 is it fair to say that the R community (I realize this is only 1 view) would take exception to this kind of "data mining"? Unless I've missed a discussion of a package that does this very thing, it seems as though I would need to code something up using fitdistr() and do all the ranking myself. Undoubtedly that would be a good exercise for me, but its hard for me to believe R would be a runner-up to something like distribution fitting in @Risk. Eric -- View this message in context: http://r.789695.n4.nabble.com/R-s-Data-Dredging-Philosophy-for-Distribution-Fitting-tp2289508p2289508.html Sent from the R help mailing list archive at Nabble.com.
Ben Bolker
2010-Jul-15 01:25 UTC
[R] R's Data Dredging Philosophy for Distribution Fitting
emorway <emorway <at> engr.colostate.edu> writes:> > > Forum, > > I'm a grad student in Civil Eng, took some Stats classes that required > students learn R, and I have since taken to R and use it for as much as I > can. Back in my lab/office, many of my fellow grad students still use > proprietary software at the behest of advisers who are familiar with the > recommended software (Statistica, @Risk (Excel Add-on), etc). I have spent > a lot of time learning R and am confident it can generally out-process, > out-graph, or more simply stated, out-perform most of these other software > packages. However, one area my view has been humbled in is distribution > fitting. > > I started by reading through > http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf After that > I started digging around on this forum and found posts like this one > http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000 > that are close to what I'm after. That is, given an observation dataset, I > would like to call a function that cycles through numerous distributions > (common or not) and then ranks them for me based on Chi-Square, > Kolmogorov-Smirnov and/or Anderson-Darling, for example. > > This question was asked back in 2004: > http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response > was that this kind of thing wasn't in R nor in proprietary software to the > best of the responding author's memory. In 2010, however, this is no longer > true as @Risk's > (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg) > "Distribution Fitting" function does this very thing. And it is here that > my R pride has taken a hit. Based on the first response to the question > posed here >http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448> is it fair to say that the R community (I realize this is only 1 view) would > take exception to this kind of "data mining"? > > Unless I've missed a discussion of a package that does this very thing, it > seems as though I would need to code something up using fitdistr() and do > all the ranking myself. Undoubtedly that would be a good exercise for me, > but its hard for me to believe R would be a runner-up to something like > distribution fitting in @Risk. >I was one of the respondents in some of the threads you list above, and I still question why you're doing this in the first place: it's not *necessarily* a silly thing to do, but that would be my default position. It's not hard to hack up something that tries all the distributions fitdistr() knows up and compares their AIC values (completely ignoring sensible considerations like whether the distribution is discrete or not ...) See below ... It's hard to see how you could have a mechanistic (rather than phenomenological) model in mind if you just want to try a whole variety of families (not 1 or 2). Perhaps some flexible family like Johnson distributions <http://finzi.psych.upenn.edu/R/library/SuppDists/html/Johnson.html> would be appropriate, or log-spline densities <http://cran.r-project.org/web/packages/logspline/logspline.pdf> ... ===========distlist <- c("beta","cauchy","chi-squared","exponential", "f","gamma","geometric","log-normal","logistic", "negative binomial","normal","poisson","t","weibull") x <- runif(1000) dd <- function(...) { try(fitdistr(...),silent=TRUE) } library(MASS) s <- lapply(as.list(distlist),dd,x=x) names(s) <- distlist sapply(s,function(z) if (inherits(z,"try-error")) NA else AIC(z))
Frank E Harrell Jr
2010-Jul-15 01:31 UTC
[R] R's Data Dredging Philosophy for Distribution Fitting
On 07/14/2010 06:22 PM, emorway wrote:> > Forum, > > I'm a grad student in Civil Eng, took some Stats classes that required > students learn R, and I have since taken to R and use it for as much as I > can. Back in my lab/office, many of my fellow grad students still use > proprietary software at the behest of advisers who are familiar with the > recommended software (Statistica, @Risk (Excel Add-on), etc). I have spent > a lot of time learning R and am confident it can generally out-process, > out-graph, or more simply stated, out-perform most of these other software > packages. However, one area my view has been humbled in is distribution > fitting. > > I started by reading through > http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf After that > I started digging around on this forum and found posts like this one > http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000 > that are close to what I'm after. That is, given an observation dataset, I > would like to call a function that cycles through numerous distributions > (common or not) and then ranks them for me based on Chi-Square, > Kolmogorov-Smirnov and/or Anderson-Darling, for example. > > This question was asked back in 2004: > http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response > was that this kind of thing wasn't in R nor in proprietary software to the > best of the responding author's memory. In 2010, however, this is no longer > true as @Risk's > (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg) > "Distribution Fitting" function does this very thing. And it is here that > my R pride has taken a hit. Based on the first response to the question > posed here > http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448 > is it fair to say that the R community (I realize this is only 1 view) would > take exception to this kind of "data mining"? > > Unless I've missed a discussion of a package that does this very thing, it > seems as though I would need to code something up using fitdistr() and do > all the ranking myself. Undoubtedly that would be a good exercise for me, > but its hard for me to believe R would be a runner-up to something like > distribution fitting in @Risk. > > EricEric, I didn't read the links you provided but the approach you have advocated (and you are not alone) is futile. If you entertain more than about 2 distributions, the variance of the final fits is no better than the variance of the empirical cumulative distribution function (once you properly adjust variances for model uncertainty). So just go empirical. In general if your touchstone is the observed data (as in checking goodness of fit of various parametric distributions), your final estimators will have the variance of empirical estimators. Frank -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University