Hello, I have a bunch of files containing 300 data points each with values from 0 to 1 which also sum to 1 (I don't think the last element is relevant though). In addition, each data point is annotated as an "a" or a "b". I would like to know in which files (if any) the data is uniformly distributed. I used Google and found out that a Kolmogorov-Smirnov or a Chi-square goodness-of-fit test could be used. Then I looked up ?kolmogorov and found "ks.test", but the example there is for the normal distribution and I am not sure how to adapt it for the uniform distribution. I did ?runif and read about the uniform distribution but it doesn't say what the "cumulative distribution" is. Is it "punif", like "pnorm"? I thought of that because I found a message on this list where someone was told to use "pnorm" instead of "dnorm". But the help page on the uniform distribution says punif is the "distribution function". Are the "cumulative distribution" and the "distribution function" the same thing? Having several names for the same thing has always confused me very much in statistics. Also, I am not sure whether I need to specify any parameters for the distribution and which. I thought maybe I should specify "min=0" and "max=1" but those appear to be the defaults. Do I need to specify q, the vector of quantiles? So is> ks.test(x, punif)correct or not for what I am attempting to do? After this I will also need to find out whether the a's and b's are distributed randomly in each file. I would be greatful for any pointers although I have not researched this issue yet. Kairavi. [[alternative HTML version deleted]]
Yes, punif is the function to use, however the KS test (and the others) are based on an assumption of independence, and if you know that your data points sum to 1, then they are not independent (and not uniform if there are more than 2). Also note that these tests only rule out distributions (with a given type I error rate), but cannot confirm that the data comes from a given distribution (just that either they do, or there is not enough power to distinguish between the actual and the test distributions). What is your ultimate question/goal? Why do you care if the data is uniform or not? -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Kairavi Bhakta > Sent: Friday, June 10, 2011 11:24 AM > To: r-help at r-project.org > Subject: [R] Test if data uniformly distributed (newbie) > > Hello, > > I have a bunch of files containing 300 data points each with values > from 0 > to 1 which also sum to 1 (I don't think the last element is relevant > though). In addition, each data point is annotated as an "a" or a "b". > > I would like to know in which files (if any) the data is uniformly > distributed. > > I used Google and found out that a Kolmogorov-Smirnov or a Chi-square > goodness-of-fit test could be used. Then I looked up ?kolmogorov and > found > "ks.test", but the example there is for the normal distribution and I > am not > sure how to adapt it for the uniform distribution. I did ?runif and > read > about the uniform distribution but it doesn't say what the "cumulative > distribution" is. Is it "punif", like "pnorm"? I thought of that > because I > found a message on this list where someone was told to use "pnorm" > instead > of "dnorm". But the help page on the uniform distribution says punif is > the > "distribution function". Are the "cumulative distribution" and the > "distribution function" the same thing? Having several names for the > same > thing has always confused me very much in statistics. > > Also, I am not sure whether I need to specify any parameters for the > distribution and which. I thought maybe I should specify "min=0" and > "max=1" > but those appear to be the defaults. Do I need to specify q, the vector > of > quantiles? > > So is > > ks.test(x, punif) > correct or not for what I am attempting to do? > > After this I will also need to find out whether the a's and b's are > distributed randomly in each file. I would be greatful for any pointers > although I have not researched this issue yet. > > Kairavi. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Thanks for your answer. The reason I want the data to be uniform: It's the first step in a machine learning project I am working on. If I know the data isn't uniformly distributed, then this means there is probably something wrong and the following steps will be biased by the non-uniform input data. I'm not checking an assumption for another statistical test. Actually, the data has been normalized because it is supposed to represent a probability distribution. That's why it sums to 1. My assumption is that, for a vector of 5, the data at that point should look like 0.20 0.20 0.20 0.20 0.20, but of course there is variation, and I would like to test whether the data comes close enough or not. At the moment I am only testing whether there are more a's than b's in the top and bottom portion of the each file (with a wilcoxon test, I have 8 reps of the model I am trying to build). But that sort of felt like a very adhoc solution and I figured maybe testing for uniformity would be better, or at least a important addition. I've also been looking into testing for the randomness of the sequence of a's and b's instead of the wilcoxon test, although that may or may not involve R. Kairavi.> Yes, punif is the function to use, however the KS test (and the others)are based on an assumption of independence, and if you know that your data points sum to 1, then they are not independent (and not uniform if there are more than 2). Also note that these tests only rule out distributions (with a given type I error rate), but cannot confirm that the data comes from a given distribution (just that either they do, or there is not enough power to distinguish between the actual and the test distributions).> What is your ultimate question/goal? Why do you care if the data isuniform or not?> -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow@imail.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599#> > 801.408.8111[Hide Quoted Text] -----Original Message----- From: r-help-bounces@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599#>[mailto: r-help-bounces@r-<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599#> project.org] On Behalf Of Kairavi Bhakta Sent: Friday, June 10, 2011 11:24 AM To: r-help@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599#> Subject: [R] Test if data uniformly distributed (newbie) Hello, I have a bunch of files containing 300 data points each with values from 0 to 1 which also sum to 1 (I don't think the last element is relevant though). In addition, each data point is annotated as an "a" or a "b". I would like to know in which files (if any) the data is uniformly distributed. I used Google and found out that a Kolmogorov-Smirnov or a Chi-square goodness-of-fit test could be used. Then I looked up ?kolmogorov and found "ks.test", but the example there is for the normal distribution and I am not sure how to adapt it for the uniform distribution. I did ?runif and read about the uniform distribution but it doesn't say what the "cumulative distribution" is. Is it "punif", like "pnorm"? I thought of that because I found a message on this list where someone was told to use "pnorm" instead of "dnorm". But the help page on the uniform distribution says punif is the "distribution function". Are the "cumulative distribution" and the "distribution function" the same thing? Having several names for the same thing has always confused me very much in statistics. Also, I am not sure whether I need to specify any parameters for the distribution and which. I thought maybe I should specify "min=0" and "max=1" but those appear to be the defaults. Do I need to specify q, the vector of quantiles? So is ks.test(x, punif) correct or not for what I am attempting to do? After this I will also need to find out whether the a's and b's are distributed randomly in each file. I would be greatful for any pointers although I have not researched this issue yet. Kairavi. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599#>mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting<http://www.r-project.org/posting> - guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
OK, that is not the correct format for the KS test (which is expecting data ranging from 0 to 1 with a fairly flat histogram). You could possibly test this with a Chi-squared test. Can you tell us more about how the numbers you are looking at are generated? The Chi-squared test could be used on counts of 1-5 and compared to the assumption that each is equally likely, but there still is the question of power and how close to uniform is uniform enough. You would need huge samples to find a difference if the true distribution is only slightly non uniform. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow@imail.org 801.408.8111 From: kairavibhakta@googlemail.com [mailto:kairavibhakta@googlemail.com] On Behalf Of Kairavi Bhakta Sent: Friday, June 10, 2011 2:16 PM To: Greg Snow; r-help@r-project.org Subject: RE: [R] Test if data uniformly distributed (newbie) Thanks for your answer. The reason I want the data to be uniform: It's the first step in a machine learning project I am working on. If I know the data isn't uniformly distributed, then this means there is probably something wrong and the following steps will be biased by the non-uniform input data. I'm not checking an assumption for another statistical test. Actually, the data has been normalized because it is supposed to represent a probability distribution. That's why it sums to 1. My assumption is that, for a vector of 5, the data at that point should look like 0.20 0.20 0.20 0.20 0.20, but of course there is variation, and I would like to test whether the data comes close enough or not. At the moment I am only testing whether there are more a's than b's in the top and bottom portion of the each file (with a wilcoxon test, I have 8 reps of the model I am trying to build). But that sort of felt like a very adhoc solution and I figured maybe testing for uniformity would be better, or at least a important addition. I've also been looking into testing for the randomness of the sequence of a's and b's instead of the wilcoxon test, although that may or may not involve R. Kairavi.> Yes, punif is the function to use, however the KS test (and the others) are based on an assumption of independence, and if you know that your data points sum to 1, then they are not independent (and not uniform if there are more than 2). Also note that these tests only rule out distributions (with a given type I error rate), but cannot confirm that the data comes from a given distribution (just that either they do, or there is not enough power to distinguish between the actual and the test distributions).> What is your ultimate question/goal? Why do you care if the data is uniform or not?> -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow@imail.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599> > 801.408.8111[Hide Quoted Text] -----Original Message----- From: r-help-bounces@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599> [mailto:r-help-bounces@r-<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599> project.org<http://project.org>] On Behalf Of Kairavi Bhakta Sent: Friday, June 10, 2011 11:24 AM To: r-help@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599> Subject: [R] Test if data uniformly distributed (newbie) Hello, I have a bunch of files containing 300 data points each with values from 0 to 1 which also sum to 1 (I don't think the last element is relevant though). In addition, each data point is annotated as an "a" or a "b". I would like to know in which files (if any) the data is uniformly distributed. I used Google and found out that a Kolmogorov-Smirnov or a Chi-square goodness-of-fit test could be used. Then I looked up ?kolmogorov and found "ks.test", but the example there is for the normal distribution and I am not sure how to adapt it for the uniform distribution. I did ?runif and read about the uniform distribution but it doesn't say what the "cumulative distribution" is. Is it "punif", like "pnorm"? I thought of that because I found a message on this list where someone was told to use "pnorm" instead of "dnorm". But the help page on the uniform distribution says punif is the "distribution function". Are the "cumulative distribution" and the "distribution function" the same thing? Having several names for the same thing has always confused me very much in statistics. Also, I am not sure whether I need to specify any parameters for the distribution and which. I thought maybe I should specify "min=0" and "max=1" but those appear to be the defaults. Do I need to specify q, the vector of quantiles? So is ks.test(x, punif) correct or not for what I am attempting to do? After this I will also need to find out whether the a's and b's are distributed randomly in each file. I would be greatful for any pointers although I have not researched this issue yet. Kairavi. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org<https://webmail.uni-saarland.de/imp/message.php?mailbox=INBOX&index=81599> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting<http://www.r-project.org/posting>- guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]