Issac Trotts
2007-Sep-04 03:48 UTC
[R] Efficient sampling from a discrete distribution in R
Hello r-help, As far as I've seen, there is no function in R dedicated to sampling from a discrete distribution with a specified mass function. The standard library doesn't come with anything called rdiscrete or rpmf, and I can't find any such thing on the cheat sheet or in the Probability Distributions chapter of _An Introduction to R_. Googling also didn't bring back anything. So, here's my first attempt at a solution. I'm hoping someone here knows of a more efficient way. # Sample from a discrete distribution with given probability mass function rdiscrete = function(size, pmf) { stopifnot(length(pmf) > 1) cmf = cumsum(pmf) icmf = function(p) { min(which(p < cmf)) } ps = runif(size) sapply(ps, icmf) } test.rdiscrete = function(N = 10000) { err.tol = 6.0 / sqrt(N) xs = rdiscrete(N, c(0.5, 0.5)) err = abs(sum(xs == 1) / N - 0.5) stopifnot(err < err.tol) list(e = err, xs = xs) } Thanks, Issac
Prof Brian Ripley
2007-Sep-04 05:18 UTC
[R] Efficient sampling from a discrete distribution in R
On Mon, 3 Sep 2007, Issac Trotts wrote:> Hello r-help, > > As far as I've seen, there is no function in R dedicated to sampling > from a discrete distribution with a specified mass function. The > standard library doesn't come with anything called rdiscrete or rpmf, > and I can't find any such thing on the cheat sheet or in the > Probability Distributions chapter of _An Introduction to R_. Googling > also didn't bring back anything. So, here's my first attempt at a > solution. I'm hoping someone here knows of a more efficient way.It's called sample(). There are much more efficient algorithms than the one you used, and sample() sometimes uses one of them (Walker's alias method): see any good book on simulation (including my 'Stochastic Simulation, 1987).> # Sample from a discrete distribution with given probability mass function > rdiscrete = function(size, pmf) { > stopifnot(length(pmf) > 1) > cmf = cumsum(pmf) > icmf = function(p) { > min(which(p < cmf)) > } > ps = runif(size) > sapply(ps, icmf) > } > > test.rdiscrete = function(N = 10000) { > err.tol = 6.0 / sqrt(N) > xs = rdiscrete(N, c(0.5, 0.5)) > err = abs(sum(xs == 1) / N - 0.5) > stopifnot(err < err.tol) > list(e = err, xs = xs) > } > > Thanks, > Issac > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Issac Trotts
2007-Sep-04 07:40 UTC
[R] Efficient sampling from a discrete distribution in R
On 9/3/07, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:> Yes, vectorize, and don't grow vectors. See 'S Programming' (details in > the R FAQ).It's easy to avoid growing the samples vector but it doesn't seem to make much difference. Vectorizing would be probably help, but in this algorithm each step depends on the one before it. Do you know of an alternative to the Chinese restaurant process that is vectorizable?> We've no idea about you (the R posting guide does ask for a signature > block), but if this is homework some people are going to be annoyed.No, it's for my job at Google, so maybe I should use my work email address to avoid confusion. Sorry if the questions are so basic that they sound like homework, but you and others on the list know things that aren't easy to find in the documentation. -- Issac Trotts Web Operations Chair Silicon Valley Web Builder http://svwebbuilder.com