Mark
2006-Mar-01 20:40 UTC
[R] Optimization problem: selecting independent rows to maximize the mean
Dear R community, I have a dataframe with 500,000 rows and 102 columns. The rows represent spatial polygons, some of which overlap others (i.e., not all rows are independent of each other). Given a particular row, the first column contains a unique "RowID". The second column contains the "Variable" of interest. The remaining 100 columns ("Overlap1" ... "Overlap100") each contain a row ID that overlaps this row (but if this row overlaps fewer than 100 other rows then the remainder of the columns "OL1...OL100" contain NA). Here's the problem: I need to select the subset of 500 independent rows that maximizes the mean and minimizes the stdev of "Variable". Clearly this requires iterative selection and comparison of rows, because each newly-selected row must be compared to rows already selected to ensure it does not overlap them. At each step, a row already selected might be removed from the subset if it can be replaced with another that increases the mean and/or reduces the stdev. The above description is a simplification of my problem, but it's a start. As I am new to R (and programming in general) I'm not sure how to start thinking about this, or even where to look. I'd appreciate any ideas that might help. Thank you, Mark
Berton Gunter
2006-Mar-01 21:07 UTC
[R] Optimization problem: selecting independent rows to maximizethe mean
This sounds either easy via a greedy algorithm or NP-hard. Moreover, it is not clear to me that 1) A subset of 500 indpendent rows exists, where I presume "independent" means pairwise nonoverlapping; 2) That the mean and sd can be simultaneously optimized as you describe-- what if the subset with maximum mean also has bigger than minimal sd? -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Mark > Sent: Wednesday, March 01, 2006 12:40 PM > To: r-help at stat.math.ethz.ch > Subject: [R] Optimization problem: selecting independent rows > to maximizethe mean > > Dear R community, > > I have a dataframe with 500,000 rows and 102 columns. The rows > represent spatial polygons, some of which overlap others (i.e., not > all rows are independent of each other). > > Given a particular row, the first column contains a unique "RowID". > The second column contains the "Variable" of interest. The remaining > 100 columns ("Overlap1" ... "Overlap100") each contain a row ID that > overlaps this row (but if this row overlaps fewer than 100 other rows > then the remainder of the columns "OL1...OL100" contain NA). > > Here's the problem: I need to select the subset of 500 independent > rows that maximizes the mean and minimizes the stdev of "Variable". > > Clearly this requires iterative selection and comparison of rows, > because each newly-selected row must be compared to rows already > selected to ensure it does not overlap them. At each step, a row > already selected might be removed from the subset if it can be > replaced with another that increases the mean and/or reduces the > stdev. > > The above description is a simplification of my problem, but > it's a start. > > As I am new to R (and programming in general) I'm not sure how to > start thinking about this, or even where to look. I'd appreciate any > ideas that might help. > > Thank you, Mark > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Gabor Grothendieck
2006-Mar-01 21:32 UTC
[R] Optimization problem: selecting independent rows to maximize the mean
Package lpSolve might help. On 3/1/06, Mark <mtb954 at gmail.com> wrote:> Dear R community, > > I have a dataframe with 500,000 rows and 102 columns. The rows > represent spatial polygons, some of which overlap others (i.e., not > all rows are independent of each other). > > Given a particular row, the first column contains a unique "RowID". > The second column contains the "Variable" of interest. The remaining > 100 columns ("Overlap1" ... "Overlap100") each contain a row ID that > overlaps this row (but if this row overlaps fewer than 100 other rows > then the remainder of the columns "OL1...OL100" contain NA). > > Here's the problem: I need to select the subset of 500 independent > rows that maximizes the mean and minimizes the stdev of "Variable". > > Clearly this requires iterative selection and comparison of rows, > because each newly-selected row must be compared to rows already > selected to ensure it does not overlap them. At each step, a row > already selected might be removed from the subset if it can be > replaced with another that increases the mean and/or reduces the > stdev. > > The above description is a simplification of my problem, but it's a start. > > As I am new to R (and programming in general) I'm not sure how to > start thinking about this, or even where to look. I'd appreciate any > ideas that might help. > > Thank you, Mark > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
nojhan
2006-Mar-02 08:28 UTC
[R] Optimization problem: selecting independent rows to maximizethe mean
Le Wed, 01 Mar 2006 13:07:07 -0800, Berton Gunter a ?crit?:> 2) That the mean and sd can be simultaneously optimized as you describe-- > what if the subset with maximum mean also has bigger than minimal sd?Then you have two choices : 1) balance the two objectives with weights, according to the importance you give to each one 2) get a list of non-dominated solutions (a "Pareto front") Does R have packages for such multi-objectives optimization problems ? Moreover, does it have a package for "difficult" (i.e. NP-hard) problems ? -- nojhan
Reasonably Related Threads
- Assignments inside lapply
- Length as fun.aggregate in cast function of reshape package: unexpected error
- R ggplot2 legend text left justify
- How can we compare corresponding values of x and y (first value of x exacly matches with the first value of y)?
- How to use `[` without evaluating the arguments.