Thomas Subia
2021-Sep-04 16:25 UTC
[R] . Re: Splitting a data column randomly into 3 groups
I was wondering if this is a good alternative method to split a data column into distinct groups. Let's say I want my first group to have 4 elements selected randomly mydata <- LETTERS[1:11] random_grp <- sample(mydata,4,replace=FALSE) Now random_grp is:> random_grp[1] "H" "E" "A" "D" # How's that for a random selection! Now my choices for another group of random data now becomes: data_wo_random <- setdiff(mydata,random_grp)> data_wo_random[1] "B" "C" "F" "G" "I" "J" "K" Now from this reduced dataset, I can generate another random selection with any size I choose. One problem with this is that this is cumbersome when ones original dataset is large or when one wants to subgroup the original dataset into many different subgroup sizes. Nevertheless, it's an intuitive method which is relatively easy to understand Hope this helps! Thomas Subia Statistician
Thomas, There are many approaches tried over the years to do partitioning along the lines you mentioned and others. R already has many built-in or in packages including some that are quite optimized. So anyone doing serious work can often avoid doing this the hard way and build on earlier work. Now, obviously, some people learning may take on such challenges or have them assigned as homework. And if you want to tweak the efficiency, you may be able to do things like knowing the conditions needed by sample() are met, you can directly call sample.int() and so on. But fundamentally, a large subset of all these kinds of sampling can often be done by just playing with indices. It does not matter whether your data is in the form of a list or other kind of vector or a data.frame or matrix. Anything you can subset with integers will do. So an algorithm could determine how many subsets of the indices you want and calculate how many you want in each bucket and it can be done fairly simply. One approach might be to scramble the indices in some form, and that can be a vector of them or something more like an unordered set. You then take the first number of them as needed for the first partition then the next ones for the additional partitions. Finally, you apply the selected ones to subset the original data into multiple smaller data collections. Obviously you can instead work in stages, if you prefer. Your algorithm seems to be along those lines. Start with your full data and pull out what you want for the first partition. Then with what is left, repeat for the second partition and so on, till what is left is used for the final partition. Arguably this may in some ways be more work, especially for larger amounts of data. I do note many statistical processes, such as bootstrapping, may include allowing various kinds of overlap in which the same items are allowed to be used repeatedly, sometimes even within a single random sample. In those cases, the algorithm has to include replacement and that is a somewhat different discussion. What I am finding here is that many problems posed are not explained in the way they turn out to be needed in the end. So answering a question before we know what it is can be a tad premature and waste time all around. But in reality, some people new to R or computing in general, may be stuck precisely on understanding the question they are trying to solve or may not realize their use of language (especially when English is not one of their stronger languages) can be a problem as their listeners/readers assume they mean something else. Your suggestion of how to do some things is reasonable and you note a question about larger amounts of data. Many languages, for example Python, will often have higher-level abilities arguable better designed for some problems. R was built with vectorization in mind and most things are ultimately vectors in a sense. For some purposes, it would be nice to have an implementation of primitives along the lines of sets and bags and hashes/dictionaries and so on. People have added some things along these lines but if you look at the implementation of your use of setdiff(), it just calls unique on two arguments coerced into vector format! So consider what happens if you instead start in a language where you can use a native construct called a set where sets are implemented efficiently. To make N groupings that are distinct, you might start by adding all the indices, or even complete entities, into the set. You can then ask to get a random element from the set (perhaps also with deletion) until you have reached your N items. You can then ask for the next group of N' and then N'' till you have what you need and perhaps the set is empty. An implementation that uses some form of hashing to store and access set items can make finding things take the same amount of time no matter the size. There is no endless copying of parts of data. There may be no need for a setdiff() step or if used, a more efficient way it is done. And, of course, if your data can have things like redundant elements or rows, some kind of bag primitive may be useful and you can do your scrambling and partitioning without using obvious indices. R has been used extensively for a long time and most people use what is already there. Some create new object types or packages, of course. I have sometimes taken a hybrid approach and one interesting one is to use a hybrid environment. I have, for example, done things like the above by writing a program that has a package that allows both an R and a Python interpreter to work together on data they sort of share and at times interconvert. In an example where you have lots of statistical routines you trust in R but want some of the preparation and arrangement or further analysis, to be done with functionality you have in Python, you can sort of combine the best of both worlds into a single program. Heck, the same environment may also be creating a document in a markup language where the above language codes are embedded and selectively output results including graphics and result in something like a PDF or other document form. I know the current discussion was, sort of, about dividing a set of data into three groups. Efficiency may not be a major consideration, especially as what is done with the groups may be the dominant user of resources. But many newer techniques, such as Random Forest, may involve taking the same data and repeatedly partitioning it and running an analysis and repeating many thousands of times and then in some way combining the results. Some break it down over multiple stages till they have just a few items. Some will allow strange things like making small groups that could in theory consist of the same original item repeated several times even if in the real world that makes no sense as it may improve the results. So the partitioning part may well best be done well. -----Original Message----- From: Thomas Subia <tgs77m at yahoo.com> Sent: Saturday, September 4, 2021 12:26 PM To: r-help at r-project.org Cc: avigross at verizon.net; abouelmakarim1962 at gmail.com Subject: . Re: Splitting a data column randomly into 3 groups I was wondering if this is a good alternative method to split a data column into distinct groups. Let's say I want my first group to have 4 elements selected randomly mydata <- LETTERS[1:11] random_grp <- sample(mydata,4,replace=FALSE) Now random_grp is:> random_grp[1] "H" "E" "A" "D" # How's that for a random selection! Now my choices for another group of random data now becomes: data_wo_random <- setdiff(mydata,random_grp)> data_wo_random[1] "B" "C" "F" "G" "I" "J" "K" Now from this reduced dataset, I can generate another random selection with any size I choose. One problem with this is that this is cumbersome when ones original dataset is large or when one wants to subgroup the original dataset into many different subgroup sizes. Nevertheless, it's an intuitive method which is relatively easy to understand Hope this helps! Thomas Subia Statistician