I have data similar to this: Location Surveyor Result A 1 83 A 2 76 A 3 45 B 1 71 B 4 67 C 2 23 C 5 12 D 3 34 E 4 75 F 4 46 G 5 90 etc (5 million records in total) I need to divide the data to many subsets then randomly select 5 different locations and 5 different surveyors (one at each of the 5 randomly selected locations) for each subset. The function I have written basically picks five locations and then 1 surveyor in each location, checks that there are five different surveyors and if there isn't tries again. The problem is that for some subsets this doesn't work. Some subsets don't have enough locations/surveyors or both, but this can be checked for easily. The problem subsets do have enoughs locations and surveyors but still cannot produce 5 locations each with a different surveyor. The matrix below demonstrates such a subset: locations A B C D E 1 1 0 0 0 0 Surveyors 2 1 0 0 0 0 3 1 0 0 0 0 4 1 0 0 0 0 5 1 1 1 1 1 I cannot think of a way to check for such a situation and therefore I have simply programmed the function to give up after 100 attempts if it can't find a solution. This is not very satisfactory however as the analysis takes a very long time to run and it would also be very useful useful for me to know how many suitable solution there are. I reckon some of you clever folk out there must be able to think of a better solution. Any advice appreciated, Ben
Ben Holt <BHolt <at> bio.ku.dk> writes:> I have data similar to this: > > Location Surveyor Result > A 1 83 > A 2 76 > A 3 45 > B 1 71 > B 4 67 > C 2 23 > C 5 12 > D 3 34 > E 4 75 > F 4 46 > G 5 90 > etc (5 million records in total) > > I need to divide the data to many subsets then randomly select 5 different > locations and 5 different surveyors (one at each of the 5 randomly selected > locations) for each subset. > > The function I have written basically picks five locations and then 1 > surveyor in each location, checks that there are five different surveyors > and if there isn't tries again. The problem is that for some subsets this > doesn't work. > > Some subsets don't have enough locations/surveyors or both, but this can be > checked for easily. The problem subsets do have enoughs locations and > surveyors but still cannot produce 5 locations each with a different > surveyor. The matrix below demonstrates such a subset: > > locations > A B C D E > 1 1 0 0 0 0 > Surveyors 2 1 0 0 0 0 > 3 1 0 0 0 0 > 4 1 0 0 0 0 > 5 1 1 1 1 1 > > I cannot think of a way to check for such a situation and therefore I have > simply programmed the function to give up after 100 attempts if it can't > find a solution. This is not very satisfactory however as the analysis takes > a very long time to run and it would also be very useful useful for me to > know how many suitable solution there are.If I understand you correctly: The task of finding a maximal number of 'independent' pairs (of surveyors and locations) is called the "generalized (or: maximum) linear assignment" problem, see http://en.wikipedia.org/wiki/Generalized_assignment_problem and there appears not to exist an efficient algorithm. One can think of procedures to find at least 5 such pairs (or to report infeasibility), but I think with millions of records in one of those subsets any approach (using loops) will be intolerable slow. Perhaps you can rethink your original intention or try to provide additional information during the process of generating the data. Of course, I would be glad to hear more positive news as this kind of problem does show up in many discrete optimization tasks. Hans Werner> I reckon some of you clever folk out there must be able to think of a better > solution. > > Any advice appreciated, > > Ben >
Look for a vectorized solution such as ?table or ?aggregate to obtain a list of combination counts, followed by logical indexing or ?subset to get a set of valid combinations, and then use ?sample to get your random selections of locations/surveyors and then process only those combinations from your original data set. Keep in mind that constraints that you apply late in the selection process that can be applied earlier or on less data will usually become more efficient. If you provide actual sample statements that you have already tried, you may get a less abstract response. See the posting guide mentioned at the end of every posting for more guidance. "Ben Holt" <BHolt at bio.ku.dk> wrote:>I have data similar to this: > >Location Surveyor Result >A 1 83 >A 2 76 >A 3 45 >B 1 71 >B 4 67 >C 2 23 >C 5 12 >D 3 34 >E 4 75 >F 4 46 >G 5 90 >etc (5 million records in total) > >I need to divide the data to many subsets then randomly select 5 different locations and 5 different surveyors (one at each of the 5 randomly selected locations) for each subset. > >The function I have written basically picks five locations and then 1 surveyor in each location, checks that there are five different surveyors and if there isn't tries again. The problem is that for some subsets this doesn't work. > >Some subsets don't have enough locations/surveyors or both, but this can be checked for easily. The problem subsets do have enoughs locations and surveyors but still cannot produce 5 locations each with a different surveyor. The matrix below demonstrates such a subset: > > locations > A B C D E > 1 1 0 0 0 0 >Surveyors 2 1 0 0 0 0 > 3 1 0 0 0 0 > 4 1 0 0 0 0 > 5 1 1 1 1 1 > >I cannot think of a way to check for such a situation and therefore I have simply programmed the function to give up after 100 attempts if it can't find a solution. This is not very satisfactory however as the analysis takes a very long time to run and it would also be very useful useful for me to know how many suitable solution there are. > >I reckon some of you clever folk out there must be able to think of a better solution. > >Any advice appreciated, > >Ben > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity.