Martin Morgan
2008-Nov-19 03:52 UTC
[Rd] more efficient small subsets from moderate vectors?
This creates a named vector of length nx, then repeatedly draws a single sample from it. lkup <- function(nx, m=10000L) { tbl <- seq_len(nx) names(tbl) <- as.character(tbl) v <- sample(names(tbl), m, replace=TRUE) system.time(for(k in v) tbl[k], gcFirst=TRUE) } There is an abrupt performance degredation at nx=1000> lkup(1000)user system elapsed 0.180 0.000 0.179> lkup(1001)user system elapsed 2.444 0.016 2.462 This is because of the heuristic at stringSubscript.c:424, which switches from a 'naive' nx * ns algorithm (ns is the number of elements to be extracted, ns = 1 above) to a hash-based strategy when nx * ns > 1. It seems like the 'naive' algorithm takes nx * ns time, whereas the hash-based algorithm takes C(nx) + ns, where C(nx) is the cost of creating the hash. Guessing that the cost of building the hash is about linear in nx, the results above suggest a new heuristic for switching at ns of about 15. (I don't quite follow the last sentence of the comment above stringSubscript, so perhaps I misunderstand entirely). Martin> sessionInfo()R version 2.9.0 Under development (unstable) (2008-11-15 r46953) i686-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
Prof Brian Ripley
2008-Nov-19 06:43 UTC
[Rd] more efficient small subsets from moderate vectors?
Note that you are talking about very small times here. Yes, it probably switches early for ns=1, but is that a common usage? Do people really do lots of single lookups from long vectors -- if so they deserve what they get, and it would be better to use a hashed environment. (Indeed a strategy considered but not implemented was to attach a hash for future use.) On Tue, 18 Nov 2008, Martin Morgan wrote:> This creates a named vector of length nx, then repeatedly draws a > single sample from it. > > lkup <- function(nx, m=10000L) { > tbl <- seq_len(nx) > names(tbl) <- as.character(tbl) > v <- sample(names(tbl), m, replace=TRUE) > system.time(for(k in v) tbl[k], gcFirst=TRUE) > } > > There is an abrupt performance degredation at nx=1000 > >> lkup(1000) > user system elapsed > 0.180 0.000 0.179 >> lkup(1001) > user system elapsed > 2.444 0.016 2.462 > > This is because of the heuristic at stringSubscript.c:424, which > switches from a 'naive' nx * ns algorithm (ns is the number of > elements to be extracted, ns = 1 above) to a hash-based strategy when > nx * ns > 1. > > It seems like the 'naive' algorithm takes nx * ns time, whereas the > hash-based algorithm takes C(nx) + ns, where C(nx) is the cost of > creating the hash. Guessing that the cost of building the hash is > about linear in nx, the results above suggest a new heuristic for > switching at ns of about 15. > > (I don't quite follow the last sentence of the comment above > stringSubscript, so perhaps I misunderstand entirely). > > Martin > >> sessionInfo() > R version 2.9.0 Under development (unstable) (2008-11-15 r46953) > i686-pc-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M2 B169 > Phone: (206) 667-2793 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595