One more question about avoiding copies when modifying lists. I would
like to call a function (call it 'f') that does an operation on a
large array according to a given index. For example
f = function(data, index) sum(data[index])
The idea is to repeatedly call f() with the same 'data' but different
'index' arguments. For reasons I won't get into I need to call the
function via a do.call, so I create a list that will hold the
arguments and call the function repeatedly via do.call, as in this
rather trivial example:
> n = 2e8;
> set.seed(1);
> x = rnorm(n);
> gc();
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182412 9.8 407500 21.8 350000 18.7
Vcells 200278475 1528.1 221144237 1687.2 200519577 1529.9
## x takes roughly 1.5GB, which makes sense
> args = list(data = x);
>
> gc();
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182422 9.8 407500 21.8 350000 18.7
Vcells 400278489 3053.9 441644452 3369.5 400598513 3056.4
## Here x seems to have been copied since memory usage doubled
>
> system.time( {
+ for (i in 1:4)
+ {
+ args$index = i:(10+3*i)
+ do.call(f, args);
+ print(gc())
+ }
+ })
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182900 9.8 407500 21.8 350000 18.7
Vcells 400279034 3053.9 487077007 3716.2 401240778 3061.3
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182994 9.8 407500 21.8 350000 18.7
Vcells 400279163 3053.9 630538264 4810.7 600279205 4579.8
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182994 9.8 407500 21.8 350000 18.7
Vcells 400279171 3053.9 630538264 4810.7 600279358 4579.8
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182994 9.8 407500 21.8 350000 18.7
Vcells 400279171 3053.9 630538264 4810.7 600279376 4579.8
user system elapsed
0.808 0.617 1.447
In the second iteration the interpreter apparently needed one more
(temporary) copy of x since max used memory went up by 1.5GB again.
Note also that the timing indicates that a lot of time was spent copying memory.
This code can of course be written by calling f directly: start a new
session and use the code
> f = function(data, index) sum(data[index])
> n = 2e8;
> set.seed(1);
> x = rnorm(n);
> gc();
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 182412 9.8 407500 21.8 350000 18.7
Vcells 200278475 1528.1 221144237 1687.2 200519577
1529.9>
> system.time( {
+ for (i in 1:4)
+ {
+ index = i:(10+3*i)
+ f(x, index)
+ print(gc())
+ }
+ })
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 183320 9.8 407500 21.8 350000 18.7
Vcells 200279810 1528.1 243975520 1861.4 201806004 1539.7
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 183414 9.8 407500 21.8 350000 18.7
Vcells 200279939 1528.1 256254296 1955.1 201806004 1539.7
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 183414 9.8 407500 21.8 350000 18.7
Vcells 200279947 1528.1 269147010 2053.5 201806004 1539.7
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 183414 9.8 407500 21.8 350000 18.7
Vcells 200279947 1528.1 282684360 2156.8 201806004 1539.7
user system elapsed
0.059 0.000 0.060
Here x was not copied, and execution time is down by a huge factor.
My question is, can the list operations be made more efficient or can
one use the do.call construct or something equivalent without having
all these extra copies and the memory and time overhead they incur?
Thanks,
Peter
> sessionInfo()
R version 3.0.1 Patched (2013-06-26 r63071)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base