thr3ads.net - R devel - [Rd] more efficient small subsets from moderate vectors? [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Martin Morgan

2008-Nov-19 03:52 UTC

[Rd] more efficient small subsets from moderate vectors?

This creates a named vector of length nx, then repeatedly draws a
single sample from it.

lkup <- function(nx, m=10000L) {
    tbl <- seq_len(nx)
    names(tbl) <- as.character(tbl)
    v <- sample(names(tbl), m, replace=TRUE)
    system.time(for(k in v) tbl[k], gcFirst=TRUE)
}

There is an abrupt performance degredation at nx=1000
> lkup(1000)   user  system elapsed
  0.180   0.000   0.179> lkup(1001)   user  system elapsed
  2.444   0.016   2.462
          
This is because of the heuristic at stringSubscript.c:424, which
switches from a 'naive' nx * ns algorithm (ns is the number of
elements to be extracted, ns = 1 above) to a hash-based strategy when
nx * ns > 1.

It seems like the 'naive' algorithm takes nx * ns time, whereas the
hash-based algorithm takes C(nx) + ns, where C(nx) is the cost of
creating the hash. Guessing that the cost of building the hash is
about linear in nx, the results above suggest a new heuristic for
switching at ns of about 15.

(I don't quite follow the last sentence of the comment above
stringSubscript, so perhaps I misunderstand entirely).

Martin
> sessionInfo()R version 2.9.0 Under development (unstable) (2008-11-15 r46953)
i686-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

Prof Brian Ripley

2008-Nov-19 06:43 UTC

head link

[Rd] more efficient small subsets from moderate vectors?

Note that you are talking about very small times here.

Yes, it probably switches early for ns=1, but is that a common usage?
Do people really do lots of single lookups from long vectors -- if so they 
deserve what they get, and it would be better to use a hashed environment.
(Indeed a strategy considered but not implemented was to attach a hash for 
future use.)

On Tue, 18 Nov 2008, Martin Morgan wrote:
> This creates a named vector of length nx, then repeatedly draws a
> single sample from it.
>
> lkup <- function(nx, m=10000L) {
>    tbl <- seq_len(nx)
>    names(tbl) <- as.character(tbl)
>    v <- sample(names(tbl), m, replace=TRUE)
>    system.time(for(k in v) tbl[k], gcFirst=TRUE)
> }
>
> There is an abrupt performance degredation at nx=1000
>
>> lkup(1000)
>   user  system elapsed
>  0.180   0.000   0.179
>> lkup(1001)
>   user  system elapsed
>  2.444   0.016   2.462
>
> This is because of the heuristic at stringSubscript.c:424, which
> switches from a 'naive' nx * ns algorithm (ns is the number of
> elements to be extracted, ns = 1 above) to a hash-based strategy when
> nx * ns > 1.
>
> It seems like the 'naive' algorithm takes nx * ns time, whereas the
> hash-based algorithm takes C(nx) + ns, where C(nx) is the cost of
> creating the hash. Guessing that the cost of building the hash is
> about linear in nx, the results above suggest a new heuristic for
> switching at ns of about 15.
>
> (I don't quite follow the last sentence of the comment above
> stringSubscript, so perhaps I misunderstand entirely).
>
> Martin
>
>> sessionInfo()
> R version 2.9.0 Under development (unstable) (2008-11-15 r46953)
> i686-pc-linux-gnu
>
> locale:
>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Maybe Matching Threads

Search for more reasonably related threads

R devel - Nov 2008 - more efficient small subsets from moderate vectors?

[Rd] more efficient small subsets from moderate vectors?

[Rd] more efficient small subsets from moderate vectors?

Maybe Matching Threads