On 2/19/20 3:55 AM, Stefan Schreiber wrote:> I have posted this question on R-help where it was suggested to me
> that I might get a better response on R-devel. So far I have gotten no
> response. The post I am talking about is here:
> https://stat.ethz.ch/pipermail/r-help/2020-February/465700.html
>
> My apologies for cross-posting, which I am aware is impolite and I
> should have posted on R-devel in the first place - but I wasn't sure.
>
> Here is my question again:
>
> I am currently working through Advanced R by H. Wickham and came
> across the `lobstr::obj_size` function which appears to calculate the
> size of an object by taking into account whether the same object has
> been referenced multiple times, e.g.
>
> x <- runif(1e6)
> y <- list(x, x, x)
> lobstr::obj_size(y)
> # 8,000,128 B
>
> # versus:
> object.size(y)
> # 24000224 bytes
>
> Reading through `?object.size` in the "Details" it reads: [...]
but
> does not detect if elements of a list are shared [...].
>
> My questions are:
>
> (1) is the result of `obj_size()` the "correct" one when it comes
to
> actual size used in memory?
>
> (2) And if yes, why wouldn't `object.size()` be updated to reflect the
> more precise calculation of an object in question similar to
> `obj_size()`?
Please keep in mind that "actual size used in memory" is an elusive
concept, particularly in managed languages such as R. Even in native
languages, you have on-demand paging (not all data in physical memory,
some may be imputed (all zeros), some may be swapped out, some may be
stored in files (code), etc). Also you have internal and external
fragmentation caused by the "C library" memory allocator, overhead of
object headers and allocator meta-data. On top of that you have the
managed heap: more of internal and external fragmentation, more headers.
Moreover, memory representation may change invisibly and sometimes in
surprising ways (in R it is copy-on-write, so the sharing, but also
compact objects via ALTREP, e.g. sequences). R has the symbol table,
string cache (strings are interned, as in some other language runtimes,
so the price is paid only once for each string). In principle, managed
runtimes could do much more, including say compression of objects with
adaptive decompression, some systems internally split representation of
large objects depending on their size with additional overheads, systems
could have some transparent de-duplication (not only for strings), some
choices could be adaptive based on memory pressure. Then in R, packages
often can maintain memory related to specific R objects, linked say via
external pointers, and again there may be no meaningful way to map that
usage to individual objects.
Not only that what is a size of an object tree is not easy to define.
That information is in addition not very useful, either, because
innocuous changes may change it in arbitrary ways out of control of the
user: there is no good intuition how much that size will change from
intended application-level modifications of the tree. Users of the
system could hardly create a reliable mental model of the memory usage,
because it depends on internal design of the virtual machine, which in
addition can change over time.
As the concept is elusive, the best advice would be don't ask for the
object size, find some other solutions to your problem. In some cases,
it makes sense to ask for object size in some application-specific way,
and then implement object size methods for specific application classes
(e.g. structures holding strings would sum up number of characters in
the strings, etc). Such application-specific way may be inspired by some
particular (perhaps trivial) serialization format.
I've used object.size() myself only for profiling when quickly
identifying objects that are probably very large from objects of trivial
size, where these nuances did not matter, but for that I knew roughly
what the objects were (e.g. that they were not hiding things in
environments).
Intuitively, the choices made by object.size() in R are conservative,
they provide an over-approximation that somewhat intuitively makes sense
at user level, and they reduce surprises of significant size expansion
due to minimal updates. The choices and their limitations are
documented. I think this at least no worse than than say taking into
account sharing, looking at current "size" of compact objects, etc.
One
could provide more options to object.size(), but I don't think that it
would be useful.
Best,
Tomas
>
> There are probably valid reasons for this and any insight would be
> greatly appreciated.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel