Hi there,
I seek your expert opinion on the following memory related questions. The
output below was gotten from R-2.6.2, compiled with
--enable-memory-profiling on Ubuntu Linux.
======================================================================>>>
Code and output 1:
> gc( )
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 131180 7.1 350000 18.7 350000 18.7
Vcells 136261 1.1 786432 6.0 573372 4.4> nn <- 1000000
> ll <- list(xx = rnorm(nn), yy = rnorm(nn))
> tracemem(ll)
[1] "<0x1e32c38>"> tracemem(ll$xx)
[1] "<0x2af22e144010>"> tracemem(ll$yy)
[1] "<0x2af22e8e6010>"> ll$xx <- seq_len(nn)
> untracemem(ll)
> untracemem(ll$xx)
> untracemem(ll$yy)
>
> tracemem(ll)
[1] "<0x1e32c38>"> tracemem(ll$xx)
[1] "<0x2af22f088010>"> tracemem(ll$yy)
[1] "<0x2af22e8e6010>"> gc( )
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 131223 7.1 350000 18.7 350000 18.7
Vcells 1636589 12.5 3013550 23.0 2636590 20.2>
>>> Observation:
Note tracemem(ll) and tracemem(ll$yy) prints the same memory locations
("<0x1e32c38>" and "<0x2af22e8e6010>") and
before and after the
ll$xx <- seq_len(nn)
statement.
My hunch is:
the statement 'll$xx <- seq_len(nn)' invokes the C macro
'SET_VECTOR_ELT'
for VECSXP instead of a regular 'replacement' function
'$<-.list', which
does not exist, and so no automatic duplication of 'll' and
'll$yy' via
'*tmp*' takes place.
Am I right?
======================================================================>>>
Code and output 2:
> gc( )
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 131180 7.1 350000 18.7 350000 18.7
Vcells 136261 1.1 786432 6.0 573372 4.4> nn <- 1000000
> dd <- data.frame(xx = rnorm(nn), yy = rnorm(nn))
> tracemem(dd)
[1] "<0x1e32d88>"> tracemem(dd$xx)
[1] "<0x3d2f790>"> tracemem(dd$yy)
[1] "<0x2ded330>"> dd$xx <- seq_len(nn)
tracemem[0x1e32d88 -> 0x1e32f80]:
tracemem[0x1e32f80 -> 0x1e32c70]: $<-.data.frame $<-
tracemem[0x1e32c70 -> 0x1e32d50]: $<-.data.frame
$<-> untracemem(dd)
> untracemem(dd$xx)
> untracemem(dd$yy)
>
> tracemem(dd)
[1] "<0x1e32d50>"> tracemem(dd$xx)
[1] "<0x6725bb0>"> tracemem(dd$yy)
[1] "<0x5f84980>"> gc( )
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 132690 7.1 350000 18.7 350000 18.7
Vcells 1636790 12.5 8328568 63.6 9637580 73.6>
<<< Observation:
Note tracemem(dd) and tracemem(dd$yy) prints different memory locations
("<0x1e32d88>" --> "<0x1e32d50>" and
"<0x2ded330>" --> "<0x5f84980>") before
and after the
dd$xx <- seq_len(nn)
statement.
My hunch is:
the statement 'xx$xx <- seq_len(nn)' invokes the regular
'replacement'
function '$<-.data.frame', which results in automatic duplication of
the
whole of 'dd' (including dd$yy) via '*tmp*'.
Am I right?
======================================================================>>>
Observation.
Note the gc( ) output at the end of the computations, for the above code and
output sections. The list (i.e., ll) related code uses less memory (trigger
and max) than the one using data.frame (i.e., dd):
used (Mb) gc trigger (Mb) max used (Mb)
Vcells 1636589 12.5 3013550 23.0 2636590 20.2
vs.
Vcells 1636790 12.5 8328568 63.6 9637580 73.6
On the basis of these observations, to avoid unnecessary duplications and to
save on the max memory usage, I was thinking of writing a S4 class (called
DataFrame) that will store the columns of the data involved internally as a
list (and redefine the '[[', '[[<-', '[', ...
operations). Also, to avoid
unnecessary copying I was thinking of keeping the internal store of data
(i.e., the list) inside an environment slot, since environments are not
copied in R.
Is this a good idea?
Has this been done before?
Am I missing something?
Thanks a lot for your help in advance.
Regards,
gopi.
[[alternative HTML version deleted]]