I was experiencing a strange pattern of slowdowns when using sample.int(), where sampling from a one population would sometimes take 1000x longer than taking the same number of samples from a slightly larger population. For my application, this resulted in a runtime of several hours rather than a few seconds. Looking into it, I saw that sample.int() is hardcoded to switch algorithms when the population is larger than 1e+7, and I was straddling this boundary: sample.int <- function(n, size = n, replace = FALSE, prob = NULL) { if (!replace && is.null(prob) && n > 1e7 && size <= n/2) .Internal(sample2(n, size)) else .Internal(sample(n, size, replace, prob)) } do_sample2() takes the approach of taking a sample, and then checking if this sample is a duplicate. As long as the population size is much larger than numbers of samples, this will be efficient. This explains the check for "size <= n/2". But I'm not sure why the "n > 1e7" check is needed. I put together some sample code to show the difference in timing letting sample.int() choose the cutoff point versus manually specifying the use of do_sample2(): ### compare times for sample.int() vs internal function sample2() compareSampleTimes = function(popSizeList=c(1e5, 1e6, 1e7, 1e8, 1e9), sampleSizeList=c(10, 100, 1000, 10000), numReplications=1000) { for (sampleSize in sampleSizeList) { for (popSize in popSizeList) { elapsed1 = system.time(replicate(numReplications, sample.int(popSize, sampleSize)))[["elapsed"]] elapsed2 = system.time(replicate(numReplications, .Internal(sample2(popSize, sampleSize))))[["elapsed"]] cat(sprintf("Sample %d from %.0e: %f vs %f seconds\n", sampleSize, popSize, elapsed1, elapsed2)) } cat("\n") } } compareSampleTimes() https://gist.github.com/nkurz/8fa6ff3772a054294531 And here's the output showing the 1000x slowdowns at population sizes of 10e7 under R-3.2.2: $ Rscript compareSampleTimes.R Sample 10 from 1e+05: 0.133000 vs 0.003000 seconds Sample 10 from 1e+06: 0.931000 vs 0.003000 seconds Sample 10 from 1e+07: 13.190000 vs 0.003000 seconds Sample 10 from 1e+08: 0.004000 vs 0.003000 seconds Sample 10 from 1e+09: 0.004000 vs 0.002000 seconds Sample 100 from 1e+05: 0.180000 vs 0.007000 seconds Sample 100 from 1e+06: 0.908000 vs 0.006000 seconds Sample 100 from 1e+07: 13.161000 vs 0.007000 seconds Sample 100 from 1e+08: 0.007000 vs 0.006000 seconds Sample 100 from 1e+09: 0.007000 vs 0.006000 seconds Sample 1000 from 1e+05: 0.194000 vs 0.057000 seconds Sample 1000 from 1e+06: 1.084000 vs 0.049000 seconds Sample 1000 from 1e+07: 13.226000 vs 0.049000 seconds Sample 1000 from 1e+08: 0.047000 vs 0.046000 seconds Sample 1000 from 1e+09: 0.048000 vs 0.047000 seconds Sample 10000 from 1e+05: 0.414000 vs 0.712000 seconds Sample 10000 from 1e+06: 1.100000 vs 0.453000 seconds Sample 10000 from 1e+07: 14.882000 vs 0.443000 seconds Sample 10000 from 1e+08: 0.448000 vs 0.443000 seconds Sample 10000 from 1e+09: 0.445000 vs 0.443000 seconds Since my usage involves taking samples of 1K from populations of about 10M, the do_sample2() approach is the clear winner: .05 seconds vs 13 seconds. I tested on a couple machines, and got similar results on both. Would there be a downside to having sample.int() use the faster do_sample2() approach for populations of all sizes whenever the ratio of population size to sample size is large? --nate