thr3ads.net - R help - [R] R's Data Dredging Philosophy for Distribution Fitting [Jul 2010]

If this information is useful, please help other people find it:
Share via:

emorway

2010-Jul-14 23:22 UTC

[R] R's Data Dredging Philosophy for Distribution Fitting

Forum, 

I'm a grad student in Civil Eng, took some Stats classes that required
students learn R, and I have since taken to R and use it for as much as I
can.  Back in my lab/office, many of my fellow grad students still use
proprietary software at the behest of advisers who are familiar with the
recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
a lot of time learning R and am confident it can generally out-process,
out-graph, or more simply stated, out-perform most of these other software
packages.  However, one area my view has been humbled in is distribution
fitting.

I started by reading through
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After that
I started digging around on this forum and found posts like this one
http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000
that are close to what I'm after.  That is, given an observation dataset, I
would like to call a function that cycles through numerous distributions
(common or not) and then ranks them for me based on Chi-Square,
Kolmogorov-Smirnov and/or Anderson-Darling, for example.  

This question was asked back in 2004:
http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
was that this kind of thing wasn't in R nor in proprietary software to the
best of the responding author's memory.  In 2010, however, this is no longer
true as @Risk's
(http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
"Distribution Fitting" function does this very thing.  And it is here
that
my R pride has taken a hit.  Based on the first response to the question
posed here
http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448
is it fair to say that the R community (I realize this is only 1 view) would
take exception to this kind of "data mining"?  

Unless I've missed a discussion of a package that does this very thing, it
seems as though I would need to code something up using fitdistr() and do
all the ranking myself.  Undoubtedly that would be a good exercise for me,
but its hard for me to believe R would be a runner-up to something like
distribution fitting in @Risk.

Eric
-- 
View this message in context:
http://r.789695.n4.nabble.com/R-s-Data-Dredging-Philosophy-for-Distribution-Fitting-tp2289508p2289508.html
Sent from the R help mailing list archive at Nabble.com.

Ben Bolker

2010-Jul-15 01:25 UTC

head link

[R] R's Data Dredging Philosophy for Distribution Fitting

emorway <emorway <at> engr.colostate.edu> writes:
> 
> 
> Forum, 
> 
> I'm a grad student in Civil Eng, took some Stats classes that required
> students learn R, and I have since taken to R and use it for as much as I
> can.  Back in my lab/office, many of my fellow grad students still use
> proprietary software at the behest of advisers who are familiar with the
> recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
> a lot of time learning R and am confident it can generally out-process,
> out-graph, or more simply stated, out-perform most of these other software
> packages.  However, one area my view has been humbled in is distribution
> fitting.
> 
> I started by reading through
> http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After
that
> I started digging around on this forum and found posts like this one
>
http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000
> that are close to what I'm after.  That is, given an observation
dataset, I
> would like to call a function that cycles through numerous distributions
> (common or not) and then ranks them for me based on Chi-Square,
> Kolmogorov-Smirnov and/or Anderson-Darling, for example.  
> 
> This question was asked back in 2004:
> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
> was that this kind of thing wasn't in R nor in proprietary software to
the
> best of the responding author's memory.  In 2010, however, this is no
longer
> true as @Risk's
> (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
> "Distribution Fitting" function does this very thing.  And it is
here that
> my R pride has taken a hit.  Based on the first response to the question
> posed here
>
http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448> is it fair to say that the R community (I realize this is only 1 view)
would
> take exception to this kind of "data mining"?  
> 
> Unless I've missed a discussion of a package that does this very thing,
it
> seems as though I would need to code something up using fitdistr() and do
> all the ranking myself.  Undoubtedly that would be a good exercise for me,
> but its hard for me to believe R would be a runner-up to something like
> distribution fitting in @Risk.
> 
   I was one of the respondents in some of the threads you list above,
and I still question why you're doing this in the first place: it's not
*necessarily* a silly thing to do, but that would be my default position.

  It's not hard to hack up something that tries all the distributions
fitdistr() knows up and compares their AIC values (completely ignoring
sensible considerations like whether the distribution is discrete
or not ...)  See below ...

  It's hard to see how you could have a mechanistic (rather
than phenomenological) model in mind if you just want to try
a whole variety of families (not 1 or 2).  Perhaps some flexible
family like Johnson distributions 
<http://finzi.psych.upenn.edu/R/library/SuppDists/html/Johnson.html>
would be appropriate, or log-spline densities
<http://cran.r-project.org/web/packages/logspline/logspline.pdf> ...

===========distlist <-
c("beta","cauchy","chi-squared","exponential",
             
"f","gamma","geometric","log-normal","logistic",
              "negative
binomial","normal","poisson","t","weibull")


x <- runif(1000)

dd <- function(...) {
  try(fitdistr(...),silent=TRUE)
}

library(MASS)
s <- lapply(as.list(distlist),dd,x=x)
names(s) <- distlist

sapply(s,function(z) if (inherits(z,"try-error")) NA else AIC(z))

Frank E Harrell Jr

2010-Jul-15 01:31 UTC

head link

[R] R's Data Dredging Philosophy for Distribution Fitting

On 07/14/2010 06:22 PM, emorway wrote:>
> Forum,
>
> I'm a grad student in Civil Eng, took some Stats classes that required
> students learn R, and I have since taken to R and use it for as much as I
> can.  Back in my lab/office, many of my fellow grad students still use
> proprietary software at the behest of advisers who are familiar with the
> recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
> a lot of time learning R and am confident it can generally out-process,
> out-graph, or more simply stated, out-perform most of these other software
> packages.  However, one area my view has been humbled in is distribution
> fitting.
>
> I started by reading through
> http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After
that
> I started digging around on this forum and found posts like this one
>
http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000
> that are close to what I'm after.  That is, given an observation
dataset, I
> would like to call a function that cycles through numerous distributions
> (common or not) and then ranks them for me based on Chi-Square,
> Kolmogorov-Smirnov and/or Anderson-Darling, for example.
>
> This question was asked back in 2004:
> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
> was that this kind of thing wasn't in R nor in proprietary software to
the
> best of the responding author's memory.  In 2010, however, this is no
longer
> true as @Risk's
> (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
> "Distribution Fitting" function does this very thing.  And it is
here that
> my R pride has taken a hit.  Based on the first response to the question
> posed here
>
http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448
> is it fair to say that the R community (I realize this is only 1 view)
would
> take exception to this kind of "data mining"?
>
> Unless I've missed a discussion of a package that does this very thing,
it
> seems as though I would need to code something up using fitdistr() and do
> all the ranking myself.  Undoubtedly that would be a good exercise for me,
> but its hard for me to believe R would be a runner-up to something like
> distribution fitting in @Risk.
>
> Eric
Eric,

I didn't read the links you provided but the approach you have advocated 
(and you are not alone) is futile.  If you entertain more than about 2 
distributions, the variance of the final fits is no better than the 
variance of the empirical cumulative distribution function (once you 
properly adjust variances for model uncertainty).  So just go empirical. 
  In general if your touchstone is the observed data (as in checking 
goodness of fit of various parametric distributions), your final 
estimators will have the variance of empirical estimators.

Frank
-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Jul 2010 - R's Data Dredging Philosophy for Distribution Fitting

[R] R's Data Dredging Philosophy for Distribution Fitting

[R] R's Data Dredging Philosophy for Distribution Fitting

[R] R's Data Dredging Philosophy for Distribution Fitting

Possibly Parallel Threads