thr3ads.net - R help - [R] Random # generator accuracy [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Jim Bouldin

2009-Jul-23 17:59 UTC

[R] Random # generator accuracy

Dan Nordlund wrote:

"It would be necessary to see the code for your 'brief test' before
anyone
could meaningfully comment on your results.  But your results for a single
test could have been a valid "random" result."

I've re-created what I did below.  The problem appears to be with the
weighting process: the unweighted sample came out much closer to the actual
than the weighted sample (>1% error) did.  Comments?
Jim
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12> weights [1] 1 1 1 1 1 1 2 2 2 2 2 2
> a = mean(replicate(1000000,(sample(x, 3, prob = weights))));a  # (1million samples from x, of size 3, weighted by "weights"; the mean
should
be 7.50)
[1] 7.406977> 7.406977/7.5[1] 0.987597
> b = mean(replicate(1000000,(sample(x, 3))));b  # (1 million samples fromx, of size 3, not weighted this time; the mean should be 6.50)
[1] 6.501477> 6.501477/6.5[1] 1.000227


Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740

Greg Snow

2009-Jul-23 18:35 UTC

head link

[R] Random # generator accuracy

Try adding replace=TRUE to your call to sample, then you will get numbers closer
to what you are expecting.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Jim Bouldin
> Sent: Thursday, July 23, 2009 12:00 PM
> To: r-help at r-project.org
> Subject: [R] Random # generator accuracy
> 
> 
> Dan Nordlund wrote:
> 
> "It would be necessary to see the code for your 'brief test'
before
> anyone
> could meaningfully comment on your results.  But your results for a
> single
> test could have been a valid "random" result."
> 
> I've re-created what I did below.  The problem appears to be with the
> weighting process: the unweighted sample came out much closer to the
> actual
> than the weighted sample (>1% error) did.  Comments?
> Jim
> 
> > x
>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
> > weights
>  [1] 1 1 1 1 1 1 2 2 2 2 2 2
> 
> > a = mean(replicate(1000000,(sample(x, 3, prob = weights))));a  # (1
> million samples from x, of size 3, weighted by "weights"; the
mean
> should
> be 7.50)
> [1] 7.406977
> > 7.406977/7.5
> [1] 0.987597
> 
> > b = mean(replicate(1000000,(sample(x, 3))));b  # (1 million samples
> from
> x, of size 3, not weighted this time; the mean should be 6.50)
> [1] 6.501477
> > 6.501477/6.5
> [1] 1.000227
> 
> 
> Jim Bouldin, PhD
> Research Ecologist
> Department of Plant Sciences, UC Davis
> Davis CA, 95616
> 530-554-1740
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jim Bouldin

2009-Jul-23 18:48 UTC

head link

[R] Random # generator accuracy

Thanks Greg, that most definitely was it.  So apparently the default is
sampling without replacement.  Fine, but this brings up a question I've had
for a bit now, which is, how do you know what the default settings are for
the arguments of any given function?  The HTML help files don't seem to
indicate in many (most) cases.  Thanks. 
> Try adding replace=TRUE to your call to sample, then you will get numbers
> closer to what you are expecting.
> 
> -- 
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > project.org] On Behalf Of Jim Bouldin
> > Sent: Thursday, July 23, 2009 12:00 PM
> > To: r-help at r-project.org
> > Subject: [R] Random # generator accuracy
> > 
> > 
> > Dan Nordlund wrote:
> > 
> > "It would be necessary to see the code for your 'brief
test' before
> > anyone
> > could meaningfully comment on your results.  But your results for a
> > single
> > test could have been a valid "random" result."
> > 
> > I've re-created what I did below.  The problem appears to be with
the
> > weighting process: the unweighted sample came out much closer to the
> > actual
> > than the weighted sample (>1% error) did.  Comments?
> > Jim
> > 
> > > x
> >  [1]  1  2  3  4  5  6  7  8  9 10 11 12
> > > weights
> >  [1] 1 1 1 1 1 1 2 2 2 2 2 2
> > 
> > > a = mean(replicate(1000000,(sample(x, 3, prob = weights))));a  #
(1
> > million samples from x, of size 3, weighted by "weights";
the mean
> > should
> > be 7.50)
> > [1] 7.406977
> > > 7.406977/7.5
> > [1] 0.987597
> > 
> > > b = mean(replicate(1000000,(sample(x, 3))));b  # (1 million
samples
> > from
> > x, of size 3, not weighted this time; the mean should be 6.50)
> > [1] 6.501477
> > > 6.501477/6.5
> > [1] 1.000227
> > 
> > 
> > Jim Bouldin, PhD
> > Research Ecologist
> > Department of Plant Sciences, UC Davis
> > Davis CA, 95616
> > 530-554-1740
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740

(Ted Harding)

2009-Jul-23 20:05 UTC

head link

[R] Random # generator accuracy

On 23-Jul-09 17:59:56, Jim Bouldin wrote:> Dan Nordlund wrote:
> "It would be necessary to see the code for your 'brief test'
> before anyone could meaningfully comment on your results.
> But your results for a single test could have been a valid
> "random" result."
> 
> I've re-created what I did below.  The problem appears to be
> with the weighting process: the unweighted sample came out much
> closer to the actual than the weighted sample (>1% error) did. 
> Comments?
> Jim
> 
>> x
>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
>> weights
>  [1] 1 1 1 1 1 1 2 2 2 2 2 2
> 
>> a = mean(replicate(1000000,(sample(x, 3, prob = weights))));a  # (1
> million samples from x, of size 3, weighted by "weights"; the
mean
> should
> be 7.50)
> [1] 7.406977
>> 7.406977/7.5
> [1] 0.987597
> 
>> b = mean(replicate(1000000,(sample(x, 3))));b  # (1 million samples
>> from
> x, of size 3, not weighted this time; the mean should be 6.50)
> [1] 6.501477
>> 6.501477/6.5
> [1] 1.000227
> 
> Jim Bouldin, PhD
To obtain the result you expected, you would need to explicitly
specify "replace=TRUE", since the default for "replace" is
FALSE.
(though probably what you really intended was sampling without
replacement).

Read carefully what is said about "prob" in '?sample' -- when
replace=FALSE, the probability of inclusion of an element is not
proportional to its weight in 'prob'.

The reason is that elements with higher weights are more likely
to be chosen early on. This then knocks that higher weight out
of the contest, making it more likely that elements with smaller
weights will be sampled subsequently. Hence the mean of the sample
will be biased slightly downwards, relative to the weighted mean
of the values in x.

  table(replicate(1000000,(sample(x, 3))))
  #      1      2      3      4      5      6
  # 250235 250743 249603 250561 249828 249777
  #      7      8      9     10     11     12
  # 249780 250478 249591 249182 249625 250597

(so all nice equal frequencies)

  table(replicate(1000000,(sample(x, 3,prob=weights))))
  #      1      2      3      4      5      6
  # 174873 175398 174196 174445 173240 174110
  #      7      8      9     10     11     12
  # 325820 326140 325289 325098 325475 325916

Note that the frequencies of the values with weight=2 are a bit
less than twice the frequencies of the values with weight=1:

  (325820+326140+325289+325098+325475+325916)/
    (174873+175398+174196+174445+173240+174110)
  # [1] 


In fact this is fairly easily caluclated. The possible combinations
(in order of sampling) of the two weights, with their probabilities,
are:
                                     1s  2s
-------------------------------------------
1 1 1   P =  6/18 *  5/17 *  4/16    3   0
1 1 2   P =  6/18 *  5/17 * 12/16    2   1
1 2 1   P =  6/18 * 12/17 *  5/15    2   1
1 2 2   P =  6/18 * 12/17 * 10/15    1   2
2 1 1   P = 12/18 *  6/16 *  5/15    2   1
2 1 2   P = 12/18 *  6/16 * 10/15    1   2
2 2 1   P = 12/18 * 10/16 *  6/14    1   2
2 2 2   P = 12/18 * 10/16 *  8/14    0   3

So the expected number of "weight=1" in the sample is

  3*(6/18 *  5/17 *  4/16)  + 2*(6/18 *  5/17 * 12/16) +
  2*(6/18 * 12/17 *  5/15)  + 1*(6/18 * 12/17 * 10/15) +
  2*(12/18 *  6/16 *  5/15) + 1*(12/18 *  6/16 * 10/15) +
  1*(12/18 * 10/16 *  6/14) + 0
  = 1.046218

Hence the expected number of "weight=2" in the sample is

  3 - 1.046218 = 1.953782

and their ratio 1.953782/1.046218 = 1.867471

Compare this with the value 1.867351 (above) obtained by simulation!

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 23-Jul-09                                       Time: 21:05:07
------------------------------ XFMail ------------------------------

Jim Bouldin

2009-Jul-23 20:56 UTC

head link

[R] Random # generator accuracy

You are absolutely correct Ted.  When no weights are applied it doesn't
matter if you sample with or without replacement, because the probability
of choosing any particular value is equally distributed among all such. 
But when they're weighted unequally that's not the case.

It is also interesting to note that if the problem is set up slightly
differently, by say defining the vector x as:
x = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), effectively giving the
same probability of selection for the 12 integers as before, the same
problem does not arise, or at least not as severely:
> x2 [1]  1  2  3  4  5  6  7  8  9 10 11 12  7  8  9 10 11 12
> d = mean(replicate(1000000,(sample(x2, 3))));d  # (1 million samples fromx2, of size 3; the mean should be 7.50)
[1] 7.499233
> e = mean(replicate(1000000,(sample(x2, 3, replace = TRUE))));e  # (1million samples from x2, of size 3; with replacement this time the mean
should still be 7.50)
[1] 7.502085
> d/e[1] 0.9996198

Jim> 
> To obtain the result you expected, you would need to explicitly
> specify "replace=TRUE", since the default for "replace"
is FALSE.
> (though probably what you really intended was sampling without
> replacement).
> 
> Read carefully what is said about "prob" in '?sample' --
when
> replace=FALSE, the probability of inclusion of an element is not
> proportional to its weight in 'prob'.
> 
> The reason is that elements with higher weights are more likely
> to be chosen early on. This then knocks that higher weight out
> of the contest, making it more likely that elements with smaller
> weights will be sampled subsequently. Hence the mean of the sample
> will be biased slightly downwards, relative to the weighted mean
> of the values in x.
> 
>   table(replicate(1000000,(sample(x, 3))))
>   #      1      2      3      4      5      6
>   # 250235 250743 249603 250561 249828 249777
>   #      7      8      9     10     11     12
>   # 249780 250478 249591 249182 249625 250597
> 
> (so all nice equal frequencies)
> 
>   table(replicate(1000000,(sample(x, 3,prob=weights))))
>   #      1      2      3      4      5      6
>   # 174873 175398 174196 174445 173240 174110
>   #      7      8      9     10     11     12
>   # 325820 326140 325289 325098 325475 325916
> 
> Note that the frequencies of the values with weight=2 are a bit
> less than twice the frequencies of the values with weight=1:
> 
>   (325820+326140+325289+325098+325475+325916)/
>     (174873+175398+174196+174445+173240+174110)
>   # [1] 
> 
> 
> In fact this is fairly easily caluclated. The possible combinations
> (in order of sampling) of the two weights, with their probabilities,
> are:
>                                      1s  2s
> -------------------------------------------
> 1 1 1   P =  6/18 *  5/17 *  4/16    3   0
> 1 1 2   P =  6/18 *  5/17 * 12/16    2   1
> 1 2 1   P =  6/18 * 12/17 *  5/15    2   1
> 1 2 2   P =  6/18 * 12/17 * 10/15    1   2
> 2 1 1   P = 12/18 *  6/16 *  5/15    2   1
> 2 1 2   P = 12/18 *  6/16 * 10/15    1   2
> 2 2 1   P = 12/18 * 10/16 *  6/14    1   2
> 2 2 2   P = 12/18 * 10/16 *  8/14    0   3
> 
> So the expected number of "weight=1" in the sample is
> 
>   3*(6/18 *  5/17 *  4/16)  + 2*(6/18 *  5/17 * 12/16) +
>   2*(6/18 * 12/17 *  5/15)  + 1*(6/18 * 12/17 * 10/15) +
>   2*(12/18 *  6/16 *  5/15) + 1*(12/18 *  6/16 * 10/15) +
>   1*(12/18 * 10/16 *  6/14) + 0
>   = 1.046218
> 
> Hence the expected number of "weight=2" in the sample is
> 
>   3 - 1.046218 = 1.953782
> 
> and their ratio 1.953782/1.046218 = 1.867471
> 
> Compare this with the value 1.867351 (above) obtained by simulation!
> 
> Ted.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 23-Jul-09                                       Time: 21:05:07
> ------------------------------ XFMail ------------------------------
> 
Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740

Jim Bouldin

2009-Jul-23 23:34 UTC

head link

[R] Random # generator accuracy

Perfectly explained Ted.  One might, at first reflection, consider that
simply repeating the values 7 through 12 and  sampling (w/o replacement)
from among the 18 resulting values, would be similar to just doubling the
selection probabilities for 7 through 12 and then sampling. That would
clearly not be true though.
Jim> 
> Whereas, if you replace x = c(1,2,3,4,5,6,7,8,9,110,11,12)
> with the "weighted" equivalent, doubling up 7-12 as in your
>   x2 = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12),
> each of the 18 items now has the same weight as the others,
> and the "unweighted" sampling
>   mean(replicate(1000000,(sample(x2, 3))))
> now gives the mean of the 18 values (7.5); whereas -- as my
> calculation showed -- the effect of the "sequential" weighting is
> to bias the result slightly downwards (in your example; generally,
> in favour of the items with lower weights), since the way weighting
> works in sample() is not equivalent to replicating each item
"weight"
> times.
> 
> The general problem, of sampling without replacement in such a way
> that for each item the probability that it is included in the sample
> is proportional to a pre-assigned weight ("sampling with probability
> proportional to size") is quite tricky and, for certain choices
> of weights, impossible. For a glimpse of what's inside the can of
> worms, have a look at the reference manual for the 'sampfling'
> package, in particular the function samprop():
> 
> http://www.stats.bris.ac.uk/R/web/packages/sampfling/sampfling.pdf
> 
> Ted.
> 
> On 23-Jul-09 20:56:43, Jim Bouldin wrote:
> > 
> > You are absolutely correct Ted.  When no weights are applied it
doesn't
> > matter if you sample with or without replacement, because the
> > probability
> > of choosing any particular value is equally distributed among all
such.
> > But when they're weighted unequally that's not the case.
> > 
> > It is also interesting to note that if the problem is set up slightly
> > differently, by say defining the vector x as:
> > x = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), effectively giving
> > the
> > same probability of selection for the 12 integers as before, the same
> > problem does not arise, or at least not as severely:
> > 
> >> x2
> >  [1]  1  2  3  4  5  6  7  8  9 10 11 12  7  8  9 10 11 12
> > 
> >> d = mean(replicate(1000000,(sample(x2, 3))));d  # (1 million
samples
> >> from
> > x2, of size 3; the mean should be 7.50)
> > [1] 7.499233
> > 
> >> e = mean(replicate(1000000,(sample(x2, 3, replace = TRUE))));e  #
(1
> > million samples from x2, of size 3; with replacement this time the
mean
> > should still be 7.50)
> > [1] 7.502085
> > 
> >> d/e
> > [1] 0.9996198
> > 
> > Jim
> >> 
> >> To obtain the result you expected, you would need to explicitly
> >> specify "replace=TRUE", since the default for
"replace" is FALSE.
> >> (though probably what you really intended was sampling without
> >> replacement).
> >> 
> >>  -- when
> >> replace=FALSE, the probability of inclusion of an element is not
> >> proportional to its weight in 'prob'.
> >> 
> >> The reason is that elements with higher weights are more likely
> >> to be chosen early on. This then knocks that higher weight out
> >> of the contest, making it more likely that elements with smaller
> >> weights will be sampled subsequently. Hence the mean of the sample
> >> will be biased slightly downwards, relative to the weighted mean
> >> of the values in x.
> >> 
> >>   table(replicate(1000000,(sample(x, 3))))
> >>   #      1      2      3      4      5      6
> >>   # 250235 250743 249603 250561 249828 249777
> >>   #      7      8      9     10     11     12
> >>   # 249780 250478 249591 249182 249625 250597
> >> 
> >> (so all nice equal frequencies)
> >> 
> >>   table(replicate(1000000,(sample(x, 3,prob=weights))))
> >>   #      1      2      3      4      5      6
> >>   # 174873 175398 174196 174445 173240 174110
> >>   #      7      8      9     10     11     12
> >>   # 325820 326140 325289 325098 325475 325916
> >> 
> >> Note that the frequencies of the values with weight=2 are a bit
> >> less than twice the frequencies of the values with weight=1:
> >> 
> >>   (325820+326140+325289+325098+325475+325916)/
> >>     (174873+175398+174196+174445+173240+174110)
> >>   # [1] 
> >> 
> >> 
> >> In fact this is fairly easily caluclated. The possible
combinations
> >> (in order of sampling) of the two weights, with their
probabilities,
> >> are:
> >>                                      1s  2s
> >> -------------------------------------------
> >> 1 1 1   P =  6/18 *  5/17 *  4/16    3   0
> >> 1 1 2   P =  6/18 *  5/17 * 12/16    2   1
> >> 1 2 1   P =  6/18 * 12/17 *  5/15    2   1
> >> 1 2 2   P =  6/18 * 12/17 * 10/15    1   2
> >> 2 1 1   P = 12/18 *  6/16 *  5/15    2   1
> >> 2 1 2   P = 12/18 *  6/16 * 10/15    1   2
> >> 2 2 1   P = 12/18 * 10/16 *  6/14    1   2
> >> 2 2 2   P = 12/18 * 10/16 *  8/14    0   3
> >> 
> >> So the expected number of "weight=1" in the sample is
> >> 
> >>   3*(6/18 *  5/17 *  4/16)  + 2*(6/18 *  5/17 * 12/16) +
> >>   2*(6/18 * 12/17 *  5/15)  + 1*(6/18 * 12/17 * 10/15) +
> >>   2*(12/18 *  6/16 *  5/15) + 1*(12/18 *  6/16 * 10/15) +
> >>   1*(12/18 * 10/16 *  6/14) + 0
> >>   = 1.046218
> >> 
> >> Hence the expected number of "weight=2" in the sample is
> >> 
> >>   3 - 1.046218 = 1.953782
> >> 
> >> and their ratio 1.953782/1.046218 = 1.867471
> >> 
> >> Compare this with the value 1.867351 (above) obtained by
simulation!
> >> 
> >> Ted.
> >> 
> >>
--------------------------------------------------------------------
> >> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> >> Fax-to-email: +44 (0)870 094 0861
> >> Date: 23-Jul-09                                       Time:
21:05:07
> >> ------------------------------ XFMail
------------------------------
> >> 
> > 
> > Jim Bouldin, PhD
> > Research Ecologist
> > Department of Plant Sciences, UC Davis
> > Davis CA, 95616
> > 530-554-1740
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 23-Jul-09                                       Time: 23:00:56
> ------------------------------ XFMail ------------------------------
> 
Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740

Seemingly Similar Threads

Search for more reasonably related threads

R help - Jul 2009 - Random # generator accuracy

[R] Random # generator accuracy

[R] Random # generator accuracy

[R] Random # generator accuracy

[R] Random # generator accuracy

[R] Random # generator accuracy

[R] Random # generator accuracy

Seemingly Similar Threads