AbouEl-Makarim Aboueissa
2021-Sep-04 21:12 UTC
[R] Splitting a data column randomly into 3 groups
Dear Thomas: Thank you very much for your input in this matter. The core part of this R code(s) (please see below) was written by *Richard O'Keefe*. I had three examples with different sample sizes. *First sample of size n1 = 204* divided randomly into three groups of sizes 68. *No problems with this one*. *The second sample of size n2 = 112* divided randomly into three groups of sizes 37, 37, and 38. BUT this R code generated three groups of equal sizes (37, 37, and 37). *How to fix the code to make sure that the output will be three groups of sizes 37, 37, and 38*. *The third sample of size n3 = 284* divided randomly into three groups of sizes 94, 95, and 95. BUT this R code generated three groups of equal sizes (94, 94, and 94). *Again*, h*ow to fix the code to make sure that the output will be three groups of sizes 94, 95, and 95*. With many thanks abou ########### ------------------------ ############# N1 <- 485 population1.IDs <- seq(1, N1, by = 1) #### population1.IDs n1<-204 ##### in this case the size of each group of the three groups = 68 sample1.IDs <- sample(population1.IDs,n1) #### sample1.IDs #### n1 <- length(sample1.IDs) m1 <- n1 %/% 3 s1 <- sample(1:n1, n1) group1.IDs <- sample1.IDs[s1[1:m1]] group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]] group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]] groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) groups.IDs ####### -------------------------- N2 <- 266 population2.IDs <- seq(1, N2, by = 1) #### population2.IDs n2<-112 ##### in this case the sizes of the three groups are(37, 37, and 38) ##### BUT this codes generate three groups of equal sizes (37, 37, and 37) sample2.IDs <- sample(population2.IDs,n2) #### sample2.IDs #### n2 <- length(sample2.IDs) m2 <- n2 %/% 3 s2 <- sample(1:n2, n2) group1.IDs <- sample2.IDs[s2[1:m2]] group2.IDs <- sample2.IDs[s2[(m2+1):(2*m2)]] group3.IDs <- sample2.IDs[s2[(m2*2+1):(3*m2)]] groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) groups.IDs ####### -------------------------- N3 <- 674 population3.IDs <- seq(1, N3, by = 1) #### population3.IDs n3<-284 ##### in this case the sizes of the three groups are(94, 95, and 95) ##### BUT this codes generate three groups of equal sizes (94, 94, and 94) sample2.IDs <- sample(population2.IDs,n2) sample3.IDs <- sample(population3.IDs,n3) #### sample3.IDs #### n3 <- length(sample2.IDs) m3 <- n3 %/% 3 s3 <- sample(1:n3, n3) group1.IDs <- sample3.IDs[s3[1:m3]] group2.IDs <- sample3.IDs[s3[(m3+1):(2*m3)]] group3.IDs <- sample3.IDs[s3[(m3*2+1):(3*m3)]] groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) groups.IDs ______________________ *AbouEl-Makarim Aboueissa, PhD* *Professor, Statistics and Data Science* *Graduate Coordinator* *Department of Mathematics and Statistics* *University of Southern Maine* On Sat, Sep 4, 2021 at 11:54 AM Thomas Subia <tgs77m at yahoo.com> wrote:> Abou, > > > > I?ve been following your question on how to split a data column randomly > into 3 groups using R. > > > > My method may not be amenable for a large set of data but it surely worth > considering since it makes sense intuitively. > > > > mydata <- LETTERS[1:11] > > > mydata > > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" > > > > # Let?s choose a random sample of size 4 from mydata > > > random_grp1 > > [1] "J" "H" "D" "A" > > > > Now my next random selection of data is defined by > > data_wo_random <- setdiff(mydata,random_grp1) > > # this makes sense because I need to choose random data from a set which > is defined by the difference of the sets mydata and random_grp1 > > > > > data_wo_random > > [1] "B" "C" "E" "F" "G" "I" "K" > > > > This is great! So now I can randomly select data of any size from this set. > > Repeating this process can easily generate subgroups of your original > dataset of any size you want. > > > > Surely this method could be improved so that this could be done > automatically. > > Nevertheless, this is an intuitive method which I believe is easier to > understand than some of the other methods posted. > > > > Hope this helps! > > > > Thomas Subia > > Statistician > > > > > > > > > >[[alternative HTML version deleted]]
I have a more general problem for you. Given n items and 2 <=g <<n , how do you divide the n items into g groups that are as "equal as possible." First, operationally define "as equal as possible." Second, define the algorithm to carry out the definition. Hint: Note that sum{m[i]} for i <=g must sum to n, where m[i] is the number of items in the ith group. Third, write R code for the algorithm. Exercise for the reader. I may be wrong, but I think numerical analysts might also have a little fun here. Randomization, of course, is trivial. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Sep 4, 2021 at 2:13 PM AbouEl-Makarim Aboueissa <abouelmakarim1962 at gmail.com> wrote:> > Dear Thomas: > > > Thank you very much for your input in this matter. > > > The core part of this R code(s) (please see below) was written by *Richard > O'Keefe*. I had three examples with different sample sizes. > > > > *First sample of size n1 = 204* divided randomly into three groups of sizes > 68. *No problems with this one*. > > > > *The second sample of size n2 = 112* divided randomly into three groups of > sizes 37, 37, and 38. BUT this R code generated three groups of equal sizes > (37, 37, and 37). *How to fix the code to make sure that the output will be > three groups of sizes 37, 37, and 38*. > > > > *The third sample of size n3 = 284* divided randomly into three groups of > sizes 94, 95, and 95. BUT this R code generated three groups of equal sizes > (94, 94, and 94). *Again*, h*ow to fix the code to make sure that the > output will be three groups of sizes 94, 95, and 95*. > > > With many thanks > > abou > > > ########### ------------------------ ############# > > > N1 <- 485 > population1.IDs <- seq(1, N1, by = 1) > #### population1.IDs > > n1<-204 ##### in this case the size > of each group of the three groups = 68 > sample1.IDs <- sample(population1.IDs,n1) > #### sample1.IDs > > #### n1 <- length(sample1.IDs) > > m1 <- n1 %/% 3 > s1 <- sample(1:n1, n1) > group1.IDs <- sample1.IDs[s1[1:m1]] > group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]] > group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]] > > groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) > > groups.IDs > > > ####### -------------------------- > > > N2 <- 266 > population2.IDs <- seq(1, N2, by = 1) > #### population2.IDs > > n2<-112 ##### in this case the sizes of the three > groups are(37, 37, and 38) > ##### BUT this codes generate > three groups of equal sizes (37, 37, and 37) > sample2.IDs <- sample(population2.IDs,n2) > #### sample2.IDs > > #### n2 <- length(sample2.IDs) > > m2 <- n2 %/% 3 > s2 <- sample(1:n2, n2) > group1.IDs <- sample2.IDs[s2[1:m2]] > group2.IDs <- sample2.IDs[s2[(m2+1):(2*m2)]] > group3.IDs <- sample2.IDs[s2[(m2*2+1):(3*m2)]] > > groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) > > groups.IDs > > > ####### -------------------------- > > > > N3 <- 674 > population3.IDs <- seq(1, N3, by = 1) > #### population3.IDs > > n3<-284 ##### in this case the sizes of the three > groups are(94, 95, and 95) > ##### BUT this codes generate > three groups of equal sizes (94, 94, and 94) > sample2.IDs <- sample(population2.IDs,n2) > sample3.IDs <- sample(population3.IDs,n3) > #### sample3.IDs > > #### n3 <- length(sample2.IDs) > > m3 <- n3 %/% 3 > s3 <- sample(1:n3, n3) > group1.IDs <- sample3.IDs[s3[1:m3]] > group2.IDs <- sample3.IDs[s3[(m3+1):(2*m3)]] > group3.IDs <- sample3.IDs[s3[(m3*2+1):(3*m3)]] > > groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) > > groups.IDs > > ______________________ > > > *AbouEl-Makarim Aboueissa, PhD* > > *Professor, Statistics and Data Science* > *Graduate Coordinator* > > *Department of Mathematics and Statistics* > *University of Southern Maine* > > > > On Sat, Sep 4, 2021 at 11:54 AM Thomas Subia <tgs77m at yahoo.com> wrote: > > > Abou, > > > > > > > > I?ve been following your question on how to split a data column randomly > > into 3 groups using R. > > > > > > > > My method may not be amenable for a large set of data but it surely worth > > considering since it makes sense intuitively. > > > > > > > > mydata <- LETTERS[1:11] > > > > > mydata > > > > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" > > > > > > > > # Let?s choose a random sample of size 4 from mydata > > > > > random_grp1 > > > > [1] "J" "H" "D" "A" > > > > > > > > Now my next random selection of data is defined by > > > > data_wo_random <- setdiff(mydata,random_grp1) > > > > # this makes sense because I need to choose random data from a set which > > is defined by the difference of the sets mydata and random_grp1 > > > > > > > > > data_wo_random > > > > [1] "B" "C" "E" "F" "G" "I" "K" > > > > > > > > This is great! So now I can randomly select data of any size from this set. > > > > Repeating this process can easily generate subgroups of your original > > dataset of any size you want. > > > > > > > > Surely this method could be improved so that this could be done > > automatically. > > > > Nevertheless, this is an intuitive method which I believe is easier to > > understand than some of the other methods posted. > > > > > > > > Hope this helps! > > > > > > > > Thomas Subia > > > > Statistician > > > > > > > > > > > > > > > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Abou, I believe I addressed this issue in a private message the other day. As a general rule, truncating can leave a remainder. If M = length(whatever)/3 Then M is no longer an integer. It can be a number ending in .333... or .666... as well as 0. Now R may silently truncate something like 100/3 which you see to use and make it be as if you typed 33. Same for 2*M. In your code, you used integer division and that is a truncation too! m1 <- n1 %/% 3 s1 <- sample(1:n1, n1) group1.IDs <- sample1.IDs[s1[1:m1]] group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]] group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]] A proper solution accounts for any leftover items. One method is to leave all extra items till the end and have: MAX <- length(original or whatever) group3.IDs <- sample1.IDs[s1[(m1*2+1):MAX]] The last group then might have one or two extra items. Another is to go for a second sweep and take any leftover items and move one each into whatever groups you wish for some balance. Or, as discussed, there are packages available that let you specify percentages you want and handle these edge cases too. -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of AbouEl-Makarim Aboueissa Sent: Saturday, September 4, 2021 5:13 PM To: Thomas Subia <tgs77m at yahoo.com> Cc: R mailing list <r-help at r-project.org> Subject: Re: [R] Splitting a data column randomly into 3 groups Dear Thomas: Thank you very much for your input in this matter. The core part of this R code(s) (please see below) was written by *Richard O'Keefe*. I had three examples with different sample sizes. *First sample of size n1 = 204* divided randomly into three groups of sizes 68. *No problems with this one*. *The second sample of size n2 = 112* divided randomly into three groups of sizes 37, 37, and 38. BUT this R code generated three groups of equal sizes (37, 37, and 37). *How to fix the code to make sure that the output will be three groups of sizes 37, 37, and 38*. *The third sample of size n3 = 284* divided randomly into three groups of sizes 94, 95, and 95. BUT this R code generated three groups of equal sizes (94, 94, and 94). *Again*, h*ow to fix the code to make sure that the output will be three groups of sizes 94, 95, and 95*. With many thanks abou ########### ------------------------ ############# N1 <- 485 population1.IDs <- seq(1, N1, by = 1) #### population1.IDs n1<-204 ##### in this case the size of each group of the three groups = 68 sample1.IDs <- sample(population1.IDs,n1) #### sample1.IDs #### n1 <- length(sample1.IDs) m1 <- n1 %/% 3 s1 <- sample(1:n1, n1) group1.IDs <- sample1.IDs[s1[1:m1]] group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]] group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]] groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) groups.IDs ####### -------------------------- N2 <- 266 population2.IDs <- seq(1, N2, by = 1) #### population2.IDs n2<-112 ##### in this case the sizes of the three groups are(37, 37, and 38) ##### BUT this codes generate three groups of equal sizes (37, 37, and 37) sample2.IDs <- sample(population2.IDs,n2) #### sample2.IDs #### n2 <- length(sample2.IDs) m2 <- n2 %/% 3 s2 <- sample(1:n2, n2) group1.IDs <- sample2.IDs[s2[1:m2]] group2.IDs <- sample2.IDs[s2[(m2+1):(2*m2)]] group3.IDs <- sample2.IDs[s2[(m2*2+1):(3*m2)]] groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) groups.IDs ####### -------------------------- N3 <- 674 population3.IDs <- seq(1, N3, by = 1) #### population3.IDs n3<-284 ##### in this case the sizes of the three groups are(94, 95, and 95) ##### BUT this codes generate three groups of equal sizes (94, 94, and 94) sample2.IDs <- sample(population2.IDs,n2) sample3.IDs <- sample(population3.IDs,n3) #### sample3.IDs #### n3 <- length(sample2.IDs) m3 <- n3 %/% 3 s3 <- sample(1:n3, n3) group1.IDs <- sample3.IDs[s3[1:m3]] group2.IDs <- sample3.IDs[s3[(m3+1):(2*m3)]] group3.IDs <- sample3.IDs[s3[(m3*2+1):(3*m3)]] groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs) groups.IDs ______________________ *AbouEl-Makarim Aboueissa, PhD* *Professor, Statistics and Data Science* *Graduate Coordinator* *Department of Mathematics and Statistics* *University of Southern Maine* On Sat, Sep 4, 2021 at 11:54 AM Thomas Subia <tgs77m at yahoo.com> wrote:> Abou, > > > > I?ve been following your question on how to split a data column > randomly into 3 groups using R. > > > > My method may not be amenable for a large set of data but it surely > worth considering since it makes sense intuitively. > > > > mydata <- LETTERS[1:11] > > > mydata > > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" > > > > # Let?s choose a random sample of size 4 from mydata > > > random_grp1 > > [1] "J" "H" "D" "A" > > > > Now my next random selection of data is defined by > > data_wo_random <- setdiff(mydata,random_grp1) > > # this makes sense because I need to choose random data from a set > which is defined by the difference of the sets mydata and random_grp1 > > > > > data_wo_random > > [1] "B" "C" "E" "F" "G" "I" "K" > > > > This is great! So now I can randomly select data of any size from this set. > > Repeating this process can easily generate subgroups of your original > dataset of any size you want. > > > > Surely this method could be improved so that this could be done > automatically. > > Nevertheless, this is an intuitive method which I believe is easier to > understand than some of the other methods posted. > > > > Hope this helps! > > > > Thomas Subia > > Statistician > > > > > > > > > >[[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.