Dear list, A while ago, Vadim asked opinions on improving efficiency of sample() with prob, e.g. sample with replacement with weight. ( https://stat.ethz.ch/pipermail/r-devel/2004-September/030844.html ) He did not post what he ended up with this problem though. I am having exactly the same problem. I need to sample with replacement from a population of size 10 million with fitness values for each individual. sample() is too slow for this purpose. I implement a bisection search algorithm. It is about 30% faster than the linear search one but still not good enough to me. (I can post the function if needed). Does anybody have some good ideas? The only other idea I have is using a faster (but worse) random number generator just for this application. Thanks. Bo
In his "Introduction to Probability Models" Sheldon Ross describes (sec 11.4.1, 8th edition) the alias method for such weighted sampling. It is based on some decomposition of the original distribution (the weights) into a mixture of two-point distributions. I don't know the run-time complexity of the decomposition step. But once the decomposition is found the complexity of generating a sample is linear in the size of the sample. I haven't coded this because I had found an easier way to deal with the original problem I had at that time. HTH, Vadim> -----Original Message----- > From: r-devel-bounces at r-project.org > [mailto:r-devel-bounces at r-project.org] On Behalf Of Bo Peng > Sent: Tuesday, June 21, 2005 9:24 AM > To: r-devel at r-project.org > Subject: [Rd] efficiency of sample() with prob. > > Dear list, > > A while ago, Vadim asked opinions on improving efficiency of > sample() with prob, e.g. sample with replacement with weight. > ( > https://stat.ethz.ch/pipermail/r-devel/2004-September/030844.html ) He did not post what he ended up with this problem though.> > I am having exactly the same problem. I need to sample with > replacement from a population of size 10 million with fitness > values for each individual. sample() is too slow for this purpose. > > I implement a bisection search algorithm. It is about 30% > faster than the linear search one but still not good enough > to me. (I can post the function if needed). Does anybody have > some good ideas? The only other idea I have is using a faster > (but worse) random number generator just for this application. > > Thanks. > Bo > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
> You chose to report just one extremely favourable example, < ignore> > I do think you are being `careless with the truth'.I chose to report whatever I got and whatever I felt the result was. It was not a scientific report and it was up to you (the R-team) to validate my result and make further investigations. When I was asked for a more thorough research in this field (and thus take the responsibility to my results), I said no since I did not have enough expertise and time. After all, I could have chosen not to report anything at all. If this mailing list only accepts serious contributions from field professionals, it is time for me to quit.> Hmm. A sample every 100ns and generating samples several times faster > than tabulating them is something to be impressed by. <ignore> > Not in our tests. Did you try my 5 out of 10,000 non-zero probablilties > distribution, for example?No. I did not try. Intuitively, Walker's method may be slow because of the memory allocation and table setup stuff. It should be used with large sample and small prob vector. Bisection method should be quicker when the prob vector is large (so linear search is slow) and the sample size is relatively small. I thought bisection would be uniformly quicker than the linear search method since there is no additional cost other than a few more variables. If you have done enough experiments and know the fastest method in each case, it should be straightforward to implement a hybrid method where the best method is chosen based on the lengths of sample and prob vector. I might have been reporting partial results, but the 80 times speed-up in some cases was not faked. Cheers, Bo