thr3ads.net - R help - [R] select rows by criteria [Mar 2012]

If this information is useful, please help other people find it:
Share via:

syrvn

2012-Mar-01 12:27 UTC

[R] select rows by criteria

Hello,

I am stuck with selecting the right rows from a data frame. I think the
problem is rather how to select them
then how to implement the R code.

Consider the following data frame:

df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value
c(34,12,23,25,34,42,48,29,30,27))

What I want to achieve is to select 7 rows (values) so that the mean value
of those rows are closest
to the value of 35 and the remaining 3 rows (values) are closest to 45.
However, each value is only
allowed to be sampled once!

Any ideas, how to achieve that?


Cheers






--
View this message in context:
http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4434812.html
Sent from the R help mailing list archive at Nabble.com.

Rui Barradas

2012-Mar-01 15:18 UTC

head link

[R] select rows by criteria

Hello,



syrvn wrote> 
> Hello,
> 
> I am stuck with selecting the right rows from a data frame. I think the
> problem is rather how to select them
> then how to implement the R code.
> 
> Consider the following data frame:
> 
> df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value >
c(34,12,23,25,34,42,48,29,30,27))
> 
> What I want to achieve is to select 7 rows (values) so that the mean value
> of those rows are closest
> to the value of 35 and the remaining 3 rows (values) are closest to 45.
> However, each value is only
> allowed to be sampled once!
> 
> Any ideas, how to achieve that?
> 
> 
> Cheers
> 
See

?combn

It gives all possible combinations as a matrix (default) or list. Then,
'apply'.


#---------------------------
# Name changed to 'DF',
# 'df' is the R function for the F distribution density
# (and a frequent choice for example data in R-help!)
#
DF <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
		value = c(34,12,23,25,34,42,48,29,30,27)) 

f <- function(j, v, const) abs(mean(v[j]) - const)

inxmat <- with(DF, combn(ID, 7))
meansDist1 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 35))
(i1 <- which(meansDist1 == min(meansDist1)))

inxmat <- with(DF, combn(ID, 3))
meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45))
(i2 <- which(meansDist2 == min(meansDist2)))

meansDist3 <- meansDist1 + meansDist2  # Compromise of both criteria?
(i3 <- which(meansDist3 == min(meansDist3)))

Maybe it's combn(1:10, 3)[, 101] you want, or maybe there's another way
to
compromise the two criteria.

Hope this helps,

Rui Barradas


--
View this message in context:
http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4435257.html
Sent from the R help mailing list archive at Nabble.com.

Rui Barradas

2012-Mar-01 16:10 UTC

head link

[R] select rows by criteria

Sorry, correction:

The second index matrix is the matrix of elements not in the first,
not another combination, this time 3 out of 10.

Change this in my first post
> 
> inxmat <- with(DF, combn(ID, 3))
> meansDist2 <- apply(inxmat, 2, function(jnx) f(jnx, DF$value, 45))
> (i2 <- which(meansDist2 == min(meansDist2)))
> 
to this

inxmat2 <- with(DF, apply(inxmat, 2, function(x) setdiff(ID, x)))
meansDist2 <- apply(inxmat2, 2, function(jnx) f(jnx, DF$value, 45))
(i2 <- which(meansDist2 == min(meansDist2)))

Rui Barradas



--
View this message in context:
http://r.789695.n4.nabble.com/select-rows-by-criteria-tp4434812p4435408.html
Sent from the R help mailing list archive at Nabble.com.

Petr Savicky

2012-Mar-01 16:42 UTC

head link

[R] select rows by criteria

On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:> Hello,
> 
> I am stuck with selecting the right rows from a data frame. I think the
> problem is rather how to select them
> then how to implement the R code.
> 
> Consider the following data frame:
> 
> df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value >
c(34,12,23,25,34,42,48,29,30,27))
> 
> What I want to achieve is to select 7 rows (values) so that the mean value
> of those rows are closest
> to the value of 35 and the remaining 3 rows (values) are closest to 45.
> However, each value is only
> allowed to be sampled once!
Hi.

If some 3 rows have mean close to 45, then they have sum close
to 3*45, so the remaining 7 rows have sum close to

  sum(df$value) - 3*45 # [1] 169

and they have mean close to 169/7 = 24.14286. In other words,
the two criteria cannot be optimized together.

For this reason, let me choose the criterion on 3 rows.
The closest solution may be found as follows.

  # generate all triples and compute their means
  tripleMeans <- colMeans(combn(df$value, 3))

  # select the index of the triple with mean closest to 35
  indClosest <- which.min(abs(tripleMeans - 35))

  # generate the indices, which form the closest triple in df$value
  tripleInd <- combn(1:length(df$value), 3)[, indClosest]
  tripleInd # [1] 1 3 7

  # check the mean of the triple
  mean(df$value[tripleInd]) # [1] 35

This code constructs all triples. If it is used for k-tuples
for a larger k and for a set of n values, its complexity
will be proportional to choose(n, k), so it will be large
even for moderate n, k. It is hard to provide a significant
speed up, since some variants of "knapsack problem", which
is NP-complete, may be reduced to your question. Consequently,
it is, in general, NP-complete.

Hope this helps.

Petr Savicky.

Petr Savicky

2012-Mar-01 16:54 UTC

head link

[R] select rows by criteria

On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky
wrote:> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> > Hello,
> > 
> > I am stuck with selecting the right rows from a data frame. I think
the
> > problem is rather how to select them
> > then how to implement the R code.
> > 
> > Consider the following data frame:
> > 
> > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value > >
c(34,12,23,25,34,42,48,29,30,27))
> > 
> > What I want to achieve is to select 7 rows (values) so that the mean
value
> > of those rows are closest
> > to the value of 35 and the remaining 3 rows (values) are closest to
45.
> > However, each value is only
> > allowed to be sampled once!
> 
> Hi.
> 
> If some 3 rows have mean close to 45, then they have sum close
> to 3*45, so the remaining 7 rows have sum close to
> 
>   sum(df$value) - 3*45 # [1] 169
> 
> and they have mean close to 169/7 = 24.14286. In other words,
> the two criteria cannot be optimized together.
> 
> For this reason, let me choose the criterion on 3 rows.
> The closest solution may be found as follows.
> 
>   # generate all triples and compute their means
>   tripleMeans <- colMeans(combn(df$value, 3))
> 
>   # select the index of the triple with mean closest to 35
>   indClosest <- which.min(abs(tripleMeans - 35))
I am sorry. There should be 45 and not 35.

  indClosest <- which.min(abs(tripleMeans - 45))

  # generate the indices, which form the closest triple in df$value
  tripleInd <- combn(1:length(df$value), 3)[, indClosest]
  tripleInd # [1] 1 6 7

  # check the mean of the triple
  mean(df$value[tripleInd]) # [1] 41.33333

Petr Savicky.

Petr Savicky

2012-Mar-01 17:23 UTC

head link

[R] select rows by criteria

On Thu, Mar 01, 2012 at 05:42:48PM +0100, Petr Savicky
wrote:> On Thu, Mar 01, 2012 at 04:27:45AM -0800, syrvn wrote:
> > Hello,
> > 
> > I am stuck with selecting the right rows from a data frame. I think
the
> > problem is rather how to select them
> > then how to implement the R code.
> > 
> > Consider the following data frame:
> > 
> > df <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), value > >
c(34,12,23,25,34,42,48,29,30,27))
> > 
> > What I want to achieve is to select 7 rows (values) so that the mean
value
> > of those rows are closest
> > to the value of 35 and the remaining 3 rows (values) are closest to
45.
> > However, each value is only
> > allowed to be sampled once!
> 
> Hi.
> 
> If some 3 rows have mean close to 45, then they have sum close
> to 3*45, so the remaining 7 rows have sum close to
> 
>   sum(df$value) - 3*45 # [1] 169
> 
> and they have mean close to 169/7 = 24.14286. In other words,
> the two criteria cannot be optimized together.
> 
> For this reason, let me choose the criterion on 3 rows.
> The closest solution may be found as follows.
> 
>   # generate all triples and compute their means
>   tripleMeans <- colMeans(combn(df$value, 3))
> 
>   # select the index of the triple with mean closest to 35
>   indClosest <- which.min(abs(tripleMeans - 35))
> 
>   # generate the indices, which form the closest triple in df$value
>   tripleInd <- combn(1:length(df$value), 3)[, indClosest]
>   tripleInd # [1] 1 3 7
> 
>   # check the mean of the triple
>   mean(df$value[tripleInd]) # [1] 35
> 
> This code constructs all triples. If it is used for k-tuples
> for a larger k and for a set of n values, its complexity
> will be proportional to choose(n, k), so it will be large
> even for moderate n, k. It is hard to provide a significant
> speed up, since some variants of "knapsack problem", which
> is NP-complete, may be reduced to your question. Consequently,
> it is, in general, NP-complete.
Hi.

Also this statement requires a correction. It applies to the
search of an exact optimum if the numbers in df$value are large.
There are efficient algorithms, which find an approximate solution.
Also, if the numbers in df$value are integers (or may be rounded
to integers after an appropriate scaling), then there is an
algorithm, whose complexity is O(k*n*max(df$value)). This
may be significantly less than choose(n, k).

CRAN task view Optimization and Mathematical Programming

  http://cran.at.r-project.org/web/views/Optimization.html

may suggest also other solutions.

Petr Savicky.

Apparently Analagous Threads

smbd process eating memory

R help - Mar 2012 - select rows by criteria

[R] select rows by criteria

[R] select rows by criteria

[R] select rows by criteria

[R] select rows by criteria

[R] select rows by criteria

[R] select rows by criteria

Apparently Analagous Threads