thr3ads.net - R help - [R] Drawing a sample based on certain condition [Apr 2025]

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2025-Apr-15 07:15 UTC

[R] Drawing a sample based on certain condition

Tim, do you know what you are talking about? If so, please do share sample code
so we can follow along. If you pick nine uniformly distributed variables (the
tenth is constrained) and reject sets that do not add to the desired sum then
only (1/10)^9 or so of the draws will not be rejected. This is a vast amount of
computation for a tiny fraction of valid draws. If you progressively pick from
yet-unallocated values that don't exceed the total then the distributions of
the variables will be very small for later variables considered (unequal
distributions). There may indeed be slick algorithms for surmounting these
problems but I cannot understand if you are outlining such a solution from your
description.

On April 14, 2025 2:49:57 PM PDT, "Ebert,Timothy Aaron" <tebert at
ufl.edu> wrote:>
>1) select only denied individuals from the original data. This is S.
>2) There is a fixed sample size of exactly t
>3) There is a fixed target sum T such that sum(t values from S) = T
>
>You can reduce the problem. All large values where the max(S) + (t-1
smallest values) > T can be eliminated from S. Likewise if (t-1 max S) +
min(S) < T then the smallest values can be eliminated.
>
>You can estimate the number of options by asking how many ways can I draw t
objects from the size of S with replacement. This is the size of S raised to the
t power. If Joe has a score of 6 and Kim has a score of 6 these are different
outcomes. If not, then S can be reduced to the number of unique values.
>
>Could you randomly sample S, filter for the sum=T and then use the random
sample? Do you need every case when there are millions of cases (or more)?
>
>Is a long execution time important, or what do you consider a long execution
time? Are you planning on looking at all t and T within a range?
>
>Tim
>
>-----Original Message-----
>From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert
Gunter
>Sent: Monday, April 14, 2025 4:16 PM
>To: Duncan Murdoch <murdoch.duncan at gmail.com>
>Cc: r-help at r-project.org
>Subject: Re: [R] Drawing a sample based on certain condition
>
>[External Email]
>
>Just a comment:
>
>You wish to draw subsets of size n with or without replacement -- and I
suspect without replacement is simpler than with -- from a set of positive
integers that sum to a fixed value T. This sounds related to the so-called
subset sum problem in computational complexity theory: Given a set S of positive
integers and positive total, T, is there *any* subset of S (i.e. a sample
without replacement) whose sum is T? This is known to be NP-complete. So this
means that listing all such subsets must be NP-complete. I don't know
whether specifying that, as you have, the size of the subsets must be a fixed
number (obviously?) simplifies the problem sufficiently to make it
computationally tractable; computational wizards or suitable web searches might
tell you this. But for these reasons,  your query about how to sample the set of
all such subsets -- both those drawn with or without replacement -- seems
computationally difficult to me.
>
>
>Cheers,
>Bert
>
>
>
>"An educated person is one who can entertain new ideas, entertain
others, and entertain herself."
>
>
>
>On Mon, Apr 14, 2025 at 8:37?AM Duncan Murdoch <murdoch.duncan at
gmail.com>
>wrote:
>
>> On 2025-04-14 7:26 a.m., Brian Smith wrote:
>> > Hi,
>> >
>> > For my analytical work, I need to draw a sample of certain sample
>> > size from a denied population, where population members are marked
>> > by non-negative integers, such that sum of sample members if
fixed.
>> > For example,
>> >
>> > Population = 0:100
>> > Sample_size = 10
>> > Sample_Sum = 20
>> >
>> > Under this setup if my sample members are X1, X2, ..., X10 then I
>> > should have X1+X2+...+X10 = 20
>> >
>> > Sample drawing scheme may be with/without replacement
>> >
>> > Is there any R function to achieve this? One possibility is to
>> > employ naive trial-error approach, but this doesnt seem to be
>> > practical as it would take long time to get the final sample with
desired properties.
>> >
>> > Any pointer would be greatly appreciated.
>>
>> One general way to think of this problem is that you are defining a
>> distribution on the space of all possible samples of size 10, such
>> that the probability of a sample is X if the sum is 20, and zero
>> otherwise, and you want to sample from this distribution.
>>
>> There's probably a slick method to do that for your example, but if
>> you've got a general population instead of that special one, I
doubt it.
>>
>> What I would do is the following:
>>
>> Define another distribution on samples that has probabilities that
>> depend on the sum of the sample, with the highest probabilities
>> attached to ones with the correct sum, and probabilities for other
>> sums declining with distance from the sum.  For example, maybe
>>
>>   P(sum) = Y/(1 + abs(sum - 20))
>>
>> for some constant Y.
>>
>> You can use MCMC to sample from that distribution and then only keep
>> the samples where the sum is exactly equal to the target sum.  If you
>> do that, you don't need to care about the value of Y. but you do
need
>> to think about how proposed moves are made, and you probably need to
>> use a different function than the example above for acceptable
efficiency.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat/
>>
.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu
>> %7C722a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b84
>> %7C0%7C0%7C638802585985433701%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGki
>> OnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ
>>
%3D%3D%7C0%7C%7C%7C&sdata=KNgaZ7e92Zfs51%2F%2B%2B9%2BQkGI5hiSj9OzhDtFz
>> iWXx2D0%3D&reserved=0
>> PLEASE do read the posting guide
>> https://www/.
>>
r-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7C7
>> 22a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b84%7C0
>> %7C0%7C638802585985456014%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
>> dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%
>>
3D%7C0%7C%7C%7C&sdata=Vhraq88ySxlq4IUEqvoeRvNdhjvliYdz7VDDvmQzmkE%3D&r
>> eserved=0 and provide commented, minimal, self-contained, reproducible
>> code.
>>
>
>        [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
https://www.r-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.

Ebert,Timothy Aaron

2025-Apr-15 21:09 UTC

head link

[R] Drawing a sample based on certain condition

It is not clear if the problem is a generic example or if the requester is only
interested in the case where sample size is ten and the sum of the sample is 20.
If the true population size is 100 (as in the example) then problem is trivial.
However, there was described a "denied population, where population members
are marked by non-negative integers" that is missing from population of
100. My assumption is that this is a dummy example. The inefficiency in looking
at all possible combinations to only find a few successes is spending hours
programming to save yourself a few milliseconds in one run of the program. Take
the short simple (but inefficient) path as in the first program. It will work.
If there is more data, or if the example was not exactly like the real problem,
then the last program may help.

1) Poster stated that he wants to look at the "denied" subpopulation
within his data. Start by extracting these as the rest of the data are not
relevant.
2) Poster stated that the denied population consisted of non-negative integers
(values >= 0). The distribution of these integers was not specified.
3) The smaller the problem the more likely it can be solved. Since the sample
size seems fixed, and the target sum is probably fixed the only other approach
is to reduce the quantity of data to process. The smaller the set of data, the
fewer resources it will take to get an answer.
        A) Unless it is important to know the ratio of accepted to rejected
samples, discard all values that cannot produce a valid result. If the target
sum is 20, reject all values in the data that are greater than this.
        B) Look for other opportunities to reject data. The case in 3A is just
an obvious example.
        C) How large is your population of values that have the potential to
result in a valid set?
4) Are there other options? We do not know how this will be used?
        A) can you use a subsample of all possible results?

set.seed(2257)
values <- rnorm(10000, 0, 1000)  # My population
values <- values[values >= 0]    # Only non-negative values
values <- values[values < 16]    # Only values that might possibly work
values <- floor(values)          # Convert to integers
k <- 5            # Number of values to sum
target <- 15      # Desired total

combos <- combn(values, k)
valid_combos <- combos[, colSums(combos) == target]
print(valid_combos)

# From 100,000 values one run had 66 cases that could satisfy conditions.
# These 61 candidates generated a matrix using 257.5 MB.
# Of these there were 36985 valid combinations.
# This is without replacement.

In contrast this version took 4 minutes to run, and a few tries at increasing k
resulted in exceeding memory.
set.seed(2257)
k <- 10            # Number of values to sum
target <- 8      # Desired total
values <- rnorm(10000, 0, 1000)  # My population
values <- values[values >= 0]    # Only non-negative values
values <- values[values < target+1]    # Only values that might possibly
work
values <- floor(values)          # Convert to integers

combos <- combn(values, k)
valid_combos <- combos[, colSums(combos) == target]
print(valid_combos)


Here is a recursive approach that includes testing for cases where the
combination cannot possibly result in a valid answer. It is much faster.
However, increasing the size of values much beyond this level results in a
program that took over 2 hours without finishing. Clear your environment between
runs. This version returns 16,445 valid combinations.

set.seed(2257)
k <- 10            # Number of values to sum
target <- 20      # Desired total
values <- floor(rnorm(50000, mean = 0, sd = 1000))
values <- values[!is.na(values) & values >= 0 & values <=
target]
#Data from this source cannot have NA, but the real data might

find_combos <- function(values, k, target, start = 1, current = integer(0),
current_sum = 0, results = list()) {
  if (length(values) == 0 || any(is.na(values))) return(results)

  # Base case
  if (length(current) == k) {
    if (current_sum == target) {
      results[[length(results) + 1]] <- current
    }
    return(results)
  }

  n <- length(values)
  for (i in start:n) {
    v <- values[i]
    new_sum <- current_sum + v
    remaining_slots <- k - length(current) - 1

    if (is.na(new_sum)) next  # Skip if v or current_sum is NA

    if (new_sum > target) break
    if (remaining_slots > 0) {
      if (new_sum + remaining_slots * min(values) > target) next
      if (new_sum + remaining_slots * max(values) < target) next
    }

    results <- find_combos(values, k, target, i + 1, c(current, v), new_sum,
results)
  }

  return(results)
}

valid_combos <- find_combos(values, k, target)
print(valid_combos)

This could be refined further, but I have no idea if this is even close to what
was requested. I also have no idea if the values I have chosen are realistic.
Likely the variance I used in rnorm() is entirely wrong. However, this should
give a better idea of the sorts of things to consider when attempting what at
first looks like an insurmountable problem. You can try jumping over the
mountain but also consider reducing the size of the mountain to make the jump
easier. Also consider redefining the mountain. Do not jump the entire mountain
if you only need a small part. You could add another test such that the program
stops once it has found "this many" valid solutions.

I am terrible at recursive programming. I understand how it works, but my brain
does not work that way natively. The code is from ChatGPT, but a few simple
cases were checked using the first code. ChatGPT made a few errors in initial
tries. There could be more constraints that were not considered. I only worked
with the vector. If the vector is part of a data frame where I need to know that
"this zero" came from Bob, and the other zero came from Samantha, then
the code will not work as written. My solution would be to ask ChatGPT more
questions and keep checking to see that each improvement gave a correct answer.
ChatGPT is more likely to give junk answers as program complexity increases, or
maybe I have more difficulty in asking the right questions as complexity
increases.

Tim


-----Original Message-----
From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
Sent: Tuesday, April 15, 2025 3:16 AM
To: r-help at r-project.org; Ebert,Timothy Aaron <tebert at ufl.edu>; Bert
Gunter <bgunter.4567 at gmail.com>; Duncan Murdoch <murdoch.duncan at
gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Drawing a sample based on certain condition

[External Email]

Tim, do you know what you are talking about? If so, please do share sample code
so we can follow along. If you pick nine uniformly distributed variables (the
tenth is constrained) and reject sets that do not add to the desired sum then
only (1/10)^9 or so of the draws will not be rejected. This is a vast amount of
computation for a tiny fraction of valid draws. If you progressively pick from
yet-unallocated values that don't exceed the total then the distributions of
the variables will be very small for later variables considered (unequal
distributions). There may indeed be slick algorithms for surmounting these
problems but I cannot understand if you are outlining such a solution from your
description.

On April 14, 2025 2:49:57 PM PDT, "Ebert,Timothy Aaron" <tebert at
ufl.edu> wrote:>
>1) select only denied individuals from the original data. This is S.
>2) There is a fixed sample size of exactly t
>3) There is a fixed target sum T such that sum(t values from S) = T
>
>You can reduce the problem. All large values where the max(S) + (t-1
smallest values) > T can be eliminated from S. Likewise if (t-1 max S) +
min(S) < T then the smallest values can be eliminated.
>
>You can estimate the number of options by asking how many ways can I draw t
objects from the size of S with replacement. This is the size of S raised to the
t power. If Joe has a score of 6 and Kim has a score of 6 these are different
outcomes. If not, then S can be reduced to the number of unique values.
>
>Could you randomly sample S, filter for the sum=T and then use the random
sample? Do you need every case when there are millions of cases (or more)?
>
>Is a long execution time important, or what do you consider a long execution
time? Are you planning on looking at all t and T within a range?
>
>Tim
>
>-----Original Message-----
>From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert
Gunter
>Sent: Monday, April 14, 2025 4:16 PM
>To: Duncan Murdoch <murdoch.duncan at gmail.com>
>Cc: r-help at r-project.org
>Subject: Re: [R] Drawing a sample based on certain condition
>
>[External Email]
>
>Just a comment:
>
>You wish to draw subsets of size n with or without replacement -- and I
suspect without replacement is simpler than with -- from a set of positive
integers that sum to a fixed value T. This sounds related to the so-called
subset sum problem in computational complexity theory: Given a set S of positive
integers and positive total, T, is there *any* subset of S (i.e. a sample
without replacement) whose sum is T? This is known to be NP-complete. So this
means that listing all such subsets must be NP-complete. I don't know
whether specifying that, as you have, the size of the subsets must be a fixed
number (obviously?) simplifies the problem sufficiently to make it
computationally tractable; computational wizards or suitable web searches might
tell you this. But for these reasons,  your query about how to sample the set of
all such subsets -- both those drawn with or without replacement -- seems
computationally difficult to me.
>
>
>Cheers,
>Bert
>
>
>
>"An educated person is one who can entertain new ideas, entertain
others, and entertain herself."
>
>
>
>On Mon, Apr 14, 2025 at 8:37?AM Duncan Murdoch
><murdoch.duncan at gmail.com>
>wrote:
>
>> On 2025-04-14 7:26 a.m., Brian Smith wrote:
>> > Hi,
>> >
>> > For my analytical work, I need to draw a sample of certain sample
>> > size from a denied population, where population members are marked
>> > by non-negative integers, such that sum of sample members if
fixed.
>> > For example,
>> >
>> > Population = 0:100
>> > Sample_size = 10
>> > Sample_Sum = 20
>> >
>> > Under this setup if my sample members are X1, X2, ..., X10 then I
>> > should have X1+X2+...+X10 = 20
>> >
>> > Sample drawing scheme may be with/without replacement
>> >
>> > Is there any R function to achieve this? One possibility is to
>> > employ naive trial-error approach, but this doesnt seem to be
>> > practical as it would take long time to get the final sample with
desired properties.
>> >
>> > Any pointer would be greatly appreciated.
>>
>> One general way to think of this problem is that you are defining a
>> distribution on the space of all possible samples of size 10, such
>> that the probability of a sample is X if the sum is 20, and zero
>> otherwise, and you want to sample from this distribution.
>>
>> There's probably a slick method to do that for your example, but if
>> you've got a general population instead of that special one, I
doubt it.
>>
>> What I would do is the following:
>>
>> Define another distribution on samples that has probabilities that
>> depend on the sum of the sample, with the highest probabilities
>> attached to ones with the correct sum, and probabilities for other
>> sums declining with distance from the sum.  For example, maybe
>>
>>   P(sum) = Y/(1 + abs(sum - 20))
>>
>> for some constant Y.
>>
>> You can use MCMC to sample from that distribution and then only keep
>> the samples where the sum is exactly equal to the target sum.  If you
>> do that, you don't need to care about the value of Y. but you do
need
>> to think about how proposed moves are made, and you probably need to
>> use a different function than the example above for acceptable
efficiency.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat/
>>
.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.ed
>> u
>> %7C722a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b8
>> 4
>> %7C0%7C0%7C638802585985433701%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGk
>> i
>> OnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyf
>> Q
>>
%3D%3D%7C0%7C%7C%7C&sdata=KNgaZ7e92Zfs51%2F%2B%2B9%2BQkGI5hiSj9OzhDtF
>> z
>> iWXx2D0%3D&reserved=0
>> PLEASE do read the posting guide
>> https://www/.
>>
r-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7C
>> 7
>> 22a00e77d1343db507d08dd7b914147%7C0d4da0f84a314d76ace60a62331e1b84%7C
>> 0
>> %7C0%7C638802585985456014%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnR
>> y
>> dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D
>> %
>>
3D%7C0%7C%7C%7C&sdata=Vhraq88ySxlq4IUEqvoeRvNdhjvliYdz7VDDvmQzmkE%3D&
>> r
>> eserved=0 and provide commented, minimal, self-contained,
>> reproducible code.
>>
>
>        [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat/.
>ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7
>Cc623a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C
>0%7C0%7C638802981570144984%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
>dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3
>D%7C0%7C%7C%7C&sdata=FESBnk%2Bs0euSW8a9WXBB4pW7W3ChgGGzujmZoD23zDE%3D&r
>eserved=0 PLEASE do read the posting guide
>https://www.r/
>-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc62
>3a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C
>0%7C638802981570159758%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs
>IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C
>0%7C%7C%7C&sdata=u32xfqbftfIm%2FzZxlH9NvD1SXw6lRZulE0FoPBUJOvA%3D&reser
>ved=0 and provide commented, minimal, self-contained, reproducible
>code.
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat/.
>ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu%7
>Cc623a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C
>0%7C0%7C638802981570168291%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
>dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3
>D%7C0%7C%7C%7C&sdata=TOtcwGxtU%2ByIYZ1GEjneNEAUVqudwX9%2BA6FU3AkYPh4%3D
>&reserved=0 PLEASE do read the posting guide
>https://www.r/
>-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cc62
>3a7b217674da8b1ee08dd7bed5bfe%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C
>0%7C638802981570176634%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUs
>IlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C
>0%7C%7C%7C&sdata=gPSpdirwgJOb0AoECt3Vd0DGdPLCq9dZcXoc8lYbxB4%3D&reserve
>d=0 and provide commented, minimal, self-contained, reproducible code.
--
Sent from my phone. Please excuse my brevity.

R help - Apr 2025 - Drawing a sample based on certain condition

[R] Drawing a sample based on certain condition

[R] Drawing a sample based on certain condition