thr3ads.net - R help - [R] identify the distribution of the data [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2023-Feb-08 16:59 UTC

[R] identify the distribution of the data

1. This is a statistical question, which usually is inappropriate here:
this list is about R language (including packages) programming.

2. IMO (so others may disagree), your question indicates a profound
misunderstanding of basic statistical issues. While maybe you phrased it
poorly or I misunderstand, but "identify the type of distribution" is
basically a meaningless query. Explaining why this is so and what may be
more meaningful would require a deep dive into statistics. You might try
referencing a basic statistical text and/or online tutorials. Try searching
on "Goodness of fit", "statistical modeling" or the like.

Cheers,
Bert

On Wed, Feb 8, 2023 at 8:35 AM Bogdan Tanasa <tanasa at gmail.com> wrote:
> Dear all,
>
> I do have dataframes with numerical values such as 1,9, 20, 51, 100 etc
>
> Which way do you recommend to use in order to identify the type of the
> distribution of the data (normal, poisson, bernoulli, exponential,
> log-normal etc ..)
>
> Thanks so much,
>
> Bogdan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ebert,Timothy Aaron

2023-Feb-08 18:06 UTC

head link

[R] identify the distribution of the data

IMO) The best approach is to develop a good understanding of the individual
processes that resulted in the observed values. The blend of those processes
then results in the distribution of the observed values. This is seldom done,
and often not possible to do. The alternatives depend on why you are doing this.

0) Sometime the nature of the data suggest a distribution. You list integer
values. If all observations are integer (counts for example) then Poisson may be
appropriate. With two values then maybe the Binomial distribution. Continuous
data might be normally distributed (Gaussian distribution). If I roll one
six-sided die many times I will have a uniform distribution (assuming a fair
die). I could then try the same task but roll 2 dice and add the result. I still
have discrete values, but the shape is closer to Gaussian. The distribution
looks more and more Gaussian as I add more dice together in each roll.

1) Try a simulation. Draw 5 values from a normal distribution, make a histogram.
Then do it again. Is it easy to see that both samples are from the same
distribution? Personally, the answer is no. So increase the sample size until
you are happy with a decision that any two draws are from the same distribution.
For my part, at 1 million most people would not be able to detect any difference
between the two histograms. This helps calibrate the people. How does your
sample size compare to your choice in this exercise?

2) Given that you have sufficient data (see above), can you see the distribution
in your data? Is that good enough?

3) Are you doing this as part of following the assumptions of statistical
models? In such tests for normality, we tend to assume that a failure to reject
the null hypothesis is sufficient proof that the null hypothesis is true.
However, in most other cases we are told that a failure to reject the null
hypothesis is not sufficient to prove the null hypothesis. You need to work this
out, but the importance, consequences, and alternatives of testing model
assumptions is a large body of literature with (sometimes) widely divergent
viewpoints.

4) There are hundreds of distributions.
https://cran.r-project.org/web/views/Distributions.html but the common
distributions are seen in sites like this one: 
https://www.stat.umn.edu/geyer/old/5101/rlook.html. Given so many choices, you
can probably find one that will fit your data reasonably well. Depending on how
many data points you have will determine the reliability of that answer. Is that
really informative to the problem you are trying to solve? Answering "what
distribution do these data follow?" is not usually the goal.

Regards,
Tim

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert Gunter
Sent: Wednesday, February 8, 2023 12:00 PM
To: Bogdan Tanasa <tanasa at gmail.com>
Cc: r-help <r-help at r-project.org>
Subject: Re: [R] identify the distribution of the data

[External Email]

1. This is a statistical question, which usually is inappropriate here:
this list is about R language (including packages) programming.

2. IMO (so others may disagree), your question indicates a profound
misunderstanding of basic statistical issues. While maybe you phrased it poorly
or I misunderstand, but "identify the type of distribution" is
basically a meaningless query. Explaining why this is so and what may be more
meaningful would require a deep dive into statistics. You might try referencing
a basic statistical text and/or online tutorials. Try searching on
"Goodness of fit", "statistical modeling" or the like.

Cheers,
Bert

On Wed, Feb 8, 2023 at 8:35 AM Bogdan Tanasa <tanasa at gmail.com> wrote:
> Dear all,
>
> I do have dataframes with numerical values such as 1,9, 20, 51, 100 
> etc
>
> Which way do you recommend to use in order to identify the type of the 
> distribution of the data (normal, poisson, bernoulli, exponential, 
> log-normal etc ..)
>
> Thanks so much,
>
> Bogdan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu
> %7Cfe002d446d0d4d722f1408db09f5e78f%7C0d4da0f84a314d76ace60a62331e1b84
> %7C0%7C0%7C638114724007457767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sda
> ta=GrZd0ZRFfnvbXzZKvJy7XUkRN4IsJOykuN5xTliR4sY%3D&reserved=0
> PLEASE do read the posting guide
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7Cfe
> 002d446d0d4d722f1408db09f5e78f%7C0d4da0f84a314d76ace60a62331e1b84%7C0%
> 7C0%7C638114724007457767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Fz
> GMCrWD2aA2zBxcAKXQQEcbD1%2FILkTPB3jjCypcIfI%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7Cfe002d446d0d4d722f1408db09f5e78f%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638114724007457767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GrZd0ZRFfnvbXzZKvJy7XUkRN4IsJOykuN5xTliR4sY%3D&reserved=0
PLEASE do read the posting guide
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7Cfe002d446d0d4d722f1408db09f5e78f%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638114724007457767%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FzGMCrWD2aA2zBxcAKXQQEcbD1%2FILkTPB3jjCypcIfI%3D&reserved=0
and provide commented, minimal, self-contained, reproducible code.

R help - Feb 2023 - identify the distribution of the data

[R] identify the distribution of the data

[R] identify the distribution of the data