Serena De Stefani
2019-Aug-23 21:52 UTC
[R] A goodness of fit test for two discrete distributions with unequal variance?
I have a computer simulation in which a virtual agent end up in different areas of a layout based on several factors. There are 18 conditions in total. If I collapse the datapoint into bins, where each bin is one of the areas, the data would look like this: x0 <- c(3,3,5,5,2) # computer simulation Now I would like to validate this model having human subjects going trough the same conditions, but I run into two sets of issues: 1. the first issue is due to the fact that the dataset is discrete and small (there may be less than 5 counts in a bin, and that's a problem for a Chi-Square Goodness of Fit test), also there may be ties. After some online digging I found two options: - a permutation test - a Cramer-von Mises test of goodness-of-fit (see this paper <https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf> https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf) I thought the Cramer-von Mises test of goodness-of-fit test could work, so I ran it with made-up data for *one human subject* and I get the following result: x0 <- c(3,3,5,5,2) # computer simulation x1 <- c(4,2,5,4,3) # subject 1 library(goftest) cvm.test(x0, ecdf(x1)) >Cramer-von Mises test of goodness-of-fit>Null hypothesis: distribution ?ecdf(x1)?>data: x0 >omega2 = 0.14667, p-value = 0.4106 So far so good. But now let?s say I would like to have more than one human subject, let?s say four of them. These are the results from the additional subjects: x2 <- c(3,3,5,2,5) # subject 2 x3 <- c(2,2,5,6,3) # subject 3 x4 <- c(3,2,5,6,2) # subject 4 Now I run in the second set of issues: 2. on the one side I have a single computer simulation, on the other side I have data from four subjects. Should I take the mean of the results for the human subjects? Then would my data still be ?discrete?? Or should I run my simulation four times? But I would get always the same results, so the variance between the two datasets would be different. Any ideas? Maybe I should change the design and have more levels for my factors, so that I have more trials and the bins get bigger? [[alternative HTML version deleted]]
David Winsemius
2019-Aug-23 22:03 UTC
[R] A goodness of fit test for two discrete distributions with unequal variance?
On 8/23/19 2:52 PM, Serena De Stefani wrote:> I have a computer simulation in which a virtual agent end up in different > areas of a layout based on several factors. There are 18 conditions in > total. > If I collapse the datapoint into bins, where each bin is one of the areas, > the data would look like this: > > x0 <- c(3,3,5,5,2) # computer simulation > > Now I would like to validate this model having human subjects going trough > the same conditions, but I run into two sets of issues: > > 1. the first issue is due to the fact that the dataset is discrete and > small (there may be less than 5 counts in a bin, and that's a problem for a > Chi-Square Goodness of Fit test), also there may be ties. After some online > digging I found two options: > - a permutation test > - a Cramer-von Mises test of goodness-of-fit (see this paper > <https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf> > https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf) > > I thought the Cramer-von Mises test of goodness-of-fit test could work, so > I ran it with made-up data for *one human subject* and I get the following > result: > > x0 <- c(3,3,5,5,2) # computer simulation > x1 <- c(4,2,5,4,3) # subject 1 > > library(goftest) > > cvm.test(x0, ecdf(x1)) > > >Cramer-von Mises test of goodness-of-fit >> Null hypothesis: distribution ?ecdf(x1)? > >data: x0 > >omega2 = 0.14667, p-value = 0.4106 > > So far so good. But now let?s say I would like to have more than one human > subject, let?s say four of them. These are the results from the additional > subjects: > > x2 <- c(3,3,5,2,5) # subject 2 > x3 <- c(2,2,5,6,3) # subject 3 > x4 <- c(3,2,5,6,2) # subject 4 > > Now I run in the second set of issues: > > 2. on the one side I have a single computer simulation, on the other side I > have data from four subjects. Should I take the mean of the results for the > human subjects? Then would my data still be ?discrete?? Or should I run my > simulation four times? But I would get always the same results, so the > variance between the two datasets would be different. > > Any ideas? Maybe I should change the design and have more levels for my > factors, so that I have more trials and the bins get bigger? > > [[alternative HTML version deleted]]Statistics questions, especially those from people who have failed to heed the advice of the Posting Guide to post in plain text, are off-topic on rhelp and should be posted to a forum where statistics questions are welcomed. (My suspicion is that this question will be greeted with further requests for clarification of goals, since asking what you "should" do requires an careful explanation of what your standards of evidence are and what you are attempting to demonstrate. -- David.> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.