Pavel S. Ruzankin
2017-Oct-17 17:55 UTC
[Rd] uniform sampling without replacement algorithm
Let us consider the current uniform sampling without replacement algorithm. It resides in function do_sample in https://svn.r-project.org/R/trunk/src/main/random.c Its complexity is obviously O(n), where the sample is selected from 1...n, since the algorithm has to create a vector of length n. So when the sample size is much lesser than n, the algorithm is not effective. Algorithms with average complexity O(s log s), were s is the sample size, were described long ago. E.g. see https://www.degruyter.com/view/j/mcma.1999.5.issue-1/mcma.1999.5.1.39/mcma.1999.5.1.39.xml Here the Tree algorithm has complexity O(s log s). I suppose that there may be algorithms with complexity close to s. Is somebody planning to implement some more effective algorithm?
Pavel S. Ruzankin
2017-Oct-18 08:08 UTC
[Rd] uniform sampling without replacement algorithm
If somebody is interested I can write the code. But somebody else has to add the code for handling int / long int / double cases, since I do not have enough experience in that.
Pavel S. Ruzankin
2017-Oct-18 13:49 UTC
[Rd] uniform sampling without replacement algorithm
See also: P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209. http://dx.doi.org/10.1080/00207168408803438 Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127?140. http://dx.doi.org/10.1080/00207168208803304 In the latter paper the authors claim that their algorithms have O(s) complexity. I doubt that this statement is correct. Is it? [[alternative HTML version deleted]]
Splus used a similar method for sampling from "bigdata" objects. One problem was that sample() is used both for creating a sample and for scrambling the order of a vector. Scrambling the order of a big vector wastes time. It would be nice to be able to tell sample() that we don't care about the order. Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Oct 17, 2017 at 10:55 AM, Pavel S. Ruzankin <ruzankin at math.nsc.ru> wrote:> Let us consider the current uniform sampling without replacement > algorithm. It resides in function do_sample in > https://svn.r-project.org/R/trunk/src/main/random.c > Its complexity is obviously O(n), where the sample is selected from 1...n, > since the algorithm has to create a vector of length n. So when the sample > size is much lesser than n, the algorithm is not effective. Algorithms with > average complexity O(s log s), were s is the sample size, were described > long ago. E.g. see > https://www.degruyter.com/view/j/mcma.1999.5.issue-1/mcma. > 1999.5.1.39/mcma.1999.5.1.39.xml > Here the Tree algorithm has complexity O(s log s). I suppose that there > may be algorithms with complexity close to s. Is somebody planning to > implement some more effective algorithm? > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Pavel S. Ruzankin
2017-Oct-18 14:54 UTC
[Rd] uniform sampling without replacement algorithm
The binary tree algorithm does not need additional scrambling. I have written the R code for the algorithm in the last answer at: https://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement/46807110#46807110 However, the algorithm will probably be outperformed by hash table algorithms for relatively large sample sizes.
Seemingly Similar Threads
- uniform sampling without replacement algorithm
- uniform sampling without replacement algorithm
- memory leak in 3.3.1 rebalance?
- "layout is NULL", "Failed to get node-uuid for [...] and other errors during rebalancing in 3.3.1
- Re: supermin-helper: ext2: parent directory not found: /mnt/bigdata/var: File not found by ext2_lookup