Dear R experts, I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. How to realize that in R? Thank you for your help! Best, Liu [[alternative HTML version deleted]]
Jeff Newmiller
2020-Dec-16 13:50 UTC
[R] Help with simulation of unbalanced clustered data
This is R-help, not R-do-my-work-for-me. It is also not a homework help line. The Posting Guide is required reading. Assuming this is not homework, since each step in your problem definition can be mapped to a fairly basic operation in R (the sample function and indexing being key tools), you should be showing your work with a reproducible example that illustrates where you are stuck or why the result you are getting does not exhibit the desired properties. On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu at gmail.com> wrote:>Dear R experts, > >I want to simulate some unbalanced clustered data. The number of >clusters >is 20 and the average number of observations is 30. However, I would >like >to create an unbalanced clustered data per cluster where there are 10% >more >observations than specified (i.e., 33 rather than 30). I then want to >randomly exclude an appropriate number of observations (i.e., 60) to >arrive >at the specified average number of observations per cluster (i.e., 30). >The >probability of excluding an observation within each cluster was not >uniform >(i.e., some clusters had no cases removed and others had more >excluded). >Therefore in the end I still have 600 observations in total. How to >realize >that in R? Thank you for your help! > >Best, > >Liu > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
Hi Chao Liu, I'm having difficulty following your question, and examples. And also, I don't see the motivation for increasing, then decreasing the sample sizes. Intuitively, one would compute the correct sample sizes, first time round... But I thought I'd add some comments, just in case they're useful. If the problem relates to memberships (in clusters), then the problem can be simplified. All one needs is an integer vector, where each value is the index of the cluster. To compute random memberships of 600 observations in 20 clusters, one could run: m <- sample (1:20, 600, TRUE) To compute the number of observations per cluster, one could then run: table (m) In the above code, the probability of an observation being assigned to each cluster, is uniform. Non-uniform sampling can be achieved by supplying a 4th argument to the sample function, which is a numeric vector of weights. On Wed, Dec 16, 2020 at 10:08 PM Chao Liu <psychaoliu at gmail.com> wrote:> > Dear R experts, > > I want to simulate some unbalanced clustered data. The number of clusters > is 20 and the average number of observations is 30. However, I would like > to create an unbalanced clustered data per cluster where there are 10% more > observations than specified (i.e., 33 rather than 30). I then want to > randomly exclude an appropriate number of observations (i.e., 60) to arrive > at the specified average number of observations per cluster (i.e., 30). The > probability of excluding an observation within each cluster was not uniform > (i.e., some clusters had no cases removed and others had more excluded). > Therefore in the end I still have 600 observations in total. How to realize > that in R? Thank you for your help! > > Best, > > Liu > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.