thr3ads.net - R help - [R] The 'subset matching' challenge [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Yvonnick Noël

2009-Oct-29 14:47 UTC

[R] The 'subset matching' challenge

Dear all,

The following problem just has been submitted to me by an accountant.

In his new job, he has to close some old accounts. He has yearly 
amounts, and a list of products that have been bought over the years, at 
certain prices for which he has an exhaustive record. The problem is: He 
does not know what product was bought this or that year (don't ask). He 
does not want to find back the real story, but just write realistic 
accounts, for which the sum of a subset of product prices will give the 
exact yearly amount.

Here is a real example from his data:

# A list of 64 product prices
products = 
c(30500,30500,30500,30500,42000,42000,42000,42000,42000,42000,42000,42000,42000,42000,71040,90900,76950,35100,71190,
53730,456000,70740,70740,533600,83800,59500,27465,28000,28000,28000,28000,28000,26140,49600,77000,123289,27000,27000,27000,
27000,27000,27000,80000,33000,33000,55000,77382,48048,51186,40000,35000,21716,63051,15025,15025,15025,15025,800000,1110000,
59700,25908,829350,1198000,1031655)

# Global amount
amount = 4748652

Now he wants to find all subsets of the 'product' vector which sums to 
'amount'.

I wrote the following code, which is clearly not optimal:

# Create a matrix of subsets of size r among the integer set 1:n
subsets <- function(n, r, v = 1:n) {
  if(r <= 0) vector(mode(v), 0)
  else if(r >= n) v[1:n]
  else rbind(cbind(v[1], Recall(n-1, r-1, v[-1])),Recall(n-1, r, v[-1]))
}

# Main function
find.amount = function(amount,products) {
 
  if(sum(products)<amount) {
    cat("There is no solution.\n")
    return()
  }
 
  l = length(products)
  cat("\nThere are",l,"product prices\n\n")
  names(products) = paste("Product",1:l,sep="")
  products = sort(products)

  for(i in 2:l) {

    # If the sum of the i smallest prices is greater than amount, then stop
    if(sum(products[1:i])>amount) break

    # Look for matching subsets only in the case when the sum of i 
largest prices is greater than amount
    if(sum(rev(products)[1:i])>=amount) {
      # Generates all subsets of i indicies in 1:l
      subs = subsets(l,i)
      nl = nrow(subs)
      nc = ncol(subs)

      # Compute sums of corresponding price subsets
      sums = rowSums(matrix(products[subs],nl,nc))

      # Which ones match the global amount ?
      w = which(sums == amount)
      how.many = length(w)
      if(how.many>0) {
        cat("\n-->> There are",how.many,"solutions
with",nc,"products :\n")
        for(j in 1:how.many) {
          print(products[subs[w[j],]])
        }
      }
      else cat("\n-->> There is no solution
with",nc,"products.\n")
    }
    else cat("\n-->> There is no solution
with",i,"products.\n")
  }
}


Then I can use these functions on a smaller example:

 > find.amount(4,c(1,1,1,1,2,2))

and a number of matching subsets are provided. The problem is: This 
approach creates a whole matrix of subsets of r integers among 1:n, 
which rapidly gives huge matrices, and this is clearly not optimal for 
the real data provided above.

Would anyone have a suggestion as to an alternative and more efficient 
strategy?

Good luck,

Yvonnick Noel
University of Brittany, Rennes 2
France

Ben Bolker

2009-Oct-29 15:02 UTC

head link

[R] The 'subset matching' challenge

yvonnick noel wrote:> 
> Dear all,
> 
> The following problem just has been submitted to me by an accountant.
> 
> In his new job, he has to close some old accounts. He has yearly 
> amounts, and a list of products that have been bought over the years, at 
> certain prices for which he has an exhaustive record. The problem is: He 
> does not know what product was bought this or that year (don't ask). He
> does not want to find back the real story, but just write realistic 
> accounts, for which the sum of a subset of product prices will give the 
> exact yearly amount.
> 
> Here is a real example from his data:
> 
> # A list of 64 product prices
> products = 
>
c(30500,30500,30500,30500,42000,42000,42000,42000,42000,42000,42000,42000,42000,42000,71040,90900,76950,35100,71190,
>
53730,456000,70740,70740,533600,83800,59500,27465,28000,28000,28000,28000,28000,26140,49600,77000,123289,27000,27000,27000,
>
27000,27000,27000,80000,33000,33000,55000,77382,48048,51186,40000,35000,21716,63051,15025,15025,15025,15025,800000,1110000,
> 59700,25908,829350,1198000,1031655)
> 
> # Global amount
> amount = 4748652
> 
> Now he wants to find all subsets of the 'product' vector which sums
to
> 'amount'.
> 
> I wrote the following code, which is clearly not optimal:
> 
> # Create a matrix of subsets of size r among the integer set 1:n
> subsets <- function(n, r, v = 1:n) {
>   if(r <= 0) vector(mode(v), 0)
>   else if(r >= n) v[1:n]
>   else rbind(cbind(v[1], Recall(n-1, r-1, v[-1])),Recall(n-1, r, v[-1]))
> }
> 
> # Main function
> find.amount = function(amount,products) {
>  
>   if(sum(products)<amount) {
>     cat("There is no solution.\n")
>     return()
>   }
>  
>   l = length(products)
>   cat("\nThere are",l,"product prices\n\n")
>   names(products) = paste("Product",1:l,sep="")
>   products = sort(products)
> 
>   for(i in 2:l) {
> 
>     # If the sum of the i smallest prices is greater than amount, then
> stop
>     if(sum(products[1:i])>amount) break
> 
>     # Look for matching subsets only in the case when the sum of i 
> largest prices is greater than amount
>     if(sum(rev(products)[1:i])>=amount) {
>       # Generates all subsets of i indicies in 1:l
>       subs = subsets(l,i)
>       nl = nrow(subs)
>       nc = ncol(subs)
> 
>       # Compute sums of corresponding price subsets
>       sums = rowSums(matrix(products[subs],nl,nc))
> 
>       # Which ones match the global amount ?
>       w = which(sums == amount)
>       how.many = length(w)
>       if(how.many>0) {
>         cat("\n-->> There are",how.many,"solutions
with",nc,"products
> :\n")
>         for(j in 1:how.many) {
>           print(products[subs[w[j],]])
>         }
>       }
>       else cat("\n-->> There is no solution
with",nc,"products.\n")
>     }
>     else cat("\n-->> There is no solution
with",i,"products.\n")
>   }
> }
> 
> 
> Then I can use these functions on a smaller example:
> 
>  > find.amount(4,c(1,1,1,1,2,2))
> 
> and a number of matching subsets are provided. The problem is: This 
> approach creates a whole matrix of subsets of r integers among 1:n, 
> which rapidly gives huge matrices, and this is clearly not optimal for 
> the real data provided above.
> 
> Would anyone have a suggestion as to an alternative and more efficient 
> strategy?
> 
> Good luck,
> 
> Yvonnick Noel
> University of Brittany, Rennes 2
> France
> 
> 
  I believe this is the "knapsack problem".
RSiteSearch("knapsack") might help a little bit.

http://xkcd.com/287/

-- 
View this message in context:
http://www.nabble.com/The-%27subset-matching%27-challenge-tp26114520p26114758.html
Sent from the R help mailing list archive at Nabble.com.

Detlef Steuer

2009-Oct-29 15:53 UTC

head link

[R] The 'subset matching' challenge

If you look for "subset sum problem", you will find relevant 
information.

A starter:
http://en.wikipedia.org/wiki/Subset_sum_problem

Detlef

On Thu, 29 Oct 2009 15:47:22 +0100
Yvonnick No?l <yvonnick.noel at uhb.fr> wrote:
> Dear all,
> 
> The following problem just has been submitted to me by an accountant.
> 
> In his new job, he has to close some old accounts. He has yearly 
> amounts, and a list of products that have been bought over the years, at 
> certain prices for which he has an exhaustive record. The problem is: He 
> does not know what product was bought this or that year (don't ask). He
> does not want to find back the real story, but just write realistic 
> accounts, for which the sum of a subset of product prices will give the 
> exact yearly amount.
> 
> Here is a real example from his data:
> 
> # A list of 64 product prices
> products = 
>
c(30500,30500,30500,30500,42000,42000,42000,42000,42000,42000,42000,42000,42000,42000,71040,90900,76950,35100,71190,
>
53730,456000,70740,70740,533600,83800,59500,27465,28000,28000,28000,28000,28000,26140,49600,77000,123289,27000,27000,27000,
>
27000,27000,27000,80000,33000,33000,55000,77382,48048,51186,40000,35000,21716,63051,15025,15025,15025,15025,800000,1110000,
> 59700,25908,829350,1198000,1031655)
> 
> # Global amount
> amount = 4748652
> 
> Now he wants to find all subsets of the 'product' vector which sums
to
> 'amount'.
> 
> I wrote the following code, which is clearly not optimal:
> 
> # Create a matrix of subsets of size r among the integer set 1:n
> subsets <- function(n, r, v = 1:n) {
>   if(r <= 0) vector(mode(v), 0)
>   else if(r >= n) v[1:n]
>   else rbind(cbind(v[1], Recall(n-1, r-1, v[-1])),Recall(n-1, r, v[-1]))
> }
> 
> # Main function
> find.amount = function(amount,products) {
>  
>   if(sum(products)<amount) {
>     cat("There is no solution.\n")
>     return()
>   }
>  
>   l = length(products)
>   cat("\nThere are",l,"product prices\n\n")
>   names(products) = paste("Product",1:l,sep="")
>   products = sort(products)
> 
>   for(i in 2:l) {
> 
>     # If the sum of the i smallest prices is greater than amount, then stop
>     if(sum(products[1:i])>amount) break
> 
>     # Look for matching subsets only in the case when the sum of i 
> largest prices is greater than amount
>     if(sum(rev(products)[1:i])>=amount) {
>       # Generates all subsets of i indicies in 1:l
>       subs = subsets(l,i)
>       nl = nrow(subs)
>       nc = ncol(subs)
> 
>       # Compute sums of corresponding price subsets
>       sums = rowSums(matrix(products[subs],nl,nc))
> 
>       # Which ones match the global amount ?
>       w = which(sums == amount)
>       how.many = length(w)
>       if(how.many>0) {
>         cat("\n-->> There are",how.many,"solutions
with",nc,"products :\n")
>         for(j in 1:how.many) {
>           print(products[subs[w[j],]])
>         }
>       }
>       else cat("\n-->> There is no solution
with",nc,"products.\n")
>     }
>     else cat("\n-->> There is no solution
with",i,"products.\n")
>   }
> }
> 
> 
> Then I can use these functions on a smaller example:
> 
>  > find.amount(4,c(1,1,1,1,2,2))
> 
> and a number of matching subsets are provided. The problem is: This 
> approach creates a whole matrix of subsets of r integers among 1:n, 
> which rapidly gives huge matrices, and this is clearly not optimal for 
> the real data provided above.
> 
> Would anyone have a suggestion as to an alternative and more efficient 
> strategy?
> 
> Good luck,
> 
> Yvonnick Noel
> University of Brittany, Rennes 2
> France
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Oct 2009 - The 'subset matching' challenge

[R] The 'subset matching' challenge

[R] The 'subset matching' challenge

[R] The 'subset matching' challenge

Possibly Parallel Threads