On Thu, Jan 15, 2015 at 3:55 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:> Just wanted to start a discussion on whether R could ship with more
> appropriate GC parameters.
I've been doing a number of similar measurements, and have come to the
same conclusion.  R is currently very conservative about memory usage,
and this leads to unnecessarily poor performance on certain problems.
Changing the defaults to sizes that are more appropriate for modern
machines can often produce a 2x speedup.
On Sat, Jan 17, 2015 at 8:39 AM,  <luke-tierney at uiowa.edu>
wrote:> Martin Morgan discussed this a year or so ago and as I recall bumped
> up these values to the current defaults. I don't recall details about
> why we didn't go higher -- maybe Martin does.
I just checked, and it doesn't seem that any of the relevant values
have been increased in the last ten years.  Do you have a link to the
discussion you recall so we can see why the changes weren't made?
> I suspect the main concern would be with small memory machines in student
labs
> and less developed countries.
While a reasonable concern, I'm doubtful there are many machines for
which the current numbers are optimal.  The current minimum size
increases for node and vector heaps are 40KB and 80KB respectively.
This grows as the heap grows (min + .05 * heap), but still means that
we do many more expensive garbage collections at while growing than we
need to.  Paradoxically, the SMALL_MEMORY compile option (which is
suggestd for computers with up to 32MB of RAM) has slightly larger at
50KB and 100KB.
I think we'd get significant benefit for most users by being less
conservative about memory consumption.    The exact sizes should be
discussed, but with RAM costing about $10/GB it doesn't seem
unreasonable to assume most machines running R have multiple GB
installed, and those that don't will quite likely be running an OS
that needs a custom compiled binary anyway.
I could be way off, but my suggestion might be a 10MB start with 1MB
minimum increments for SMALL_MEMORY, 100MB start with 10MB increments
for NORMAL_MEMORY, and 1GB start with 100MB increments for
LARGE_MEMORY might be a reasonable spread.
Or one could go even larger, noting that on most systems,
overcommitted memory is not a problem until it is used.  Until we
write to it, it doesn't actually use physical RAM, just virtual
address space.  Or we could stay small, but make it possible to
programmatically increase the granularity from within R.
For ease of reference, here are the relevant sections of code:
https://github.com/wch/r-source/blob/master/src/include/Defn.h#L217
(ripley last authored on Jan 26, 2000 / pd last authored on May 8, 1999)
217  #ifndef R_NSIZE
218  #define R_NSIZE 350000L
219  #endif
220  #ifndef R_VSIZE
221  #define R_VSIZE 6291456L
222  #endif
https://github.com/wch/r-source/blob/master/src/main/startup.c#L169
(ripley last authored on Jun 9, 2004)
157 Rp->vsize = R_VSIZE;
158 Rp->nsize = R_NSIZE;
166  #define Max_Nsize 50000000 /* about 1.4Gb 32-bit, 2.8Gb 64-bit */
167  #define Max_Vsize R_SIZE_T_MAX /* unlimited */
169  #define Min_Nsize 220000
170  #define Min_Vsize (1*Mega)
https://github.com/wch/r-source/blob/master/src/main/memory.c#L335
(luke last authored on Nov 1, 2000)
#ifdef SMALL_MEMORY
336  /* On machines with only 32M of memory (or on a classic Mac OS port)
337      it might be a good idea to use settings like these that are more
338      aggressive at keeping memory usage down. */
339  static double R_NGrowIncrFrac = 0.0, R_NShrinkIncrFrac = 0.2;
340  static int R_NGrowIncrMin = 50000, R_NShrinkIncrMin = 0;
341  static double R_VGrowIncrFrac = 0.0, R_VShrinkIncrFrac = 0.2;
342  static int R_VGrowIncrMin = 100000, R_VShrinkIncrMin = 0;
343#else
344  static double R_NGrowIncrFrac = 0.05, R_NShrinkIncrFrac = 0.2;
345  static int R_NGrowIncrMin = 40000, R_NShrinkIncrMin = 0;
346  static double R_VGrowIncrFrac = 0.05, R_VShrinkIncrFrac = 0.2;
347  static int R_VGrowIncrMin = 80000, R_VShrinkIncrMin = 0;
348#endif
static void AdjustHeapSize(R_size_t size_needed)
{
    R_size_t R_MinNFree = (R_size_t)(orig_R_NSize * R_MinFreeFrac);
    R_size_t R_MinVFree = (R_size_t)(orig_R_VSize * R_MinFreeFrac);
    R_size_t NNeeded = R_NodesInUse + R_MinNFree;
    R_size_t VNeeded = R_SmallVallocSize + R_LargeVallocSize +
size_needed + R_MinVFree;
    double node_occup = ((double) NNeeded) / R_NSize;
    double vect_occup = ((double) VNeeded) / R_VSize;
    if (node_occup > R_NGrowFrac) {
        R_size_t change = (R_size_t)(R_NGrowIncrMin + R_NGrowIncrFrac
* R_NSize);
        if (R_MaxNSize >= R_NSize + change)
           R_NSize += change;
    }
    else if (node_occup < R_NShrinkFrac) {
        R_NSize -= (R_NShrinkIncrMin + R_NShrinkIncrFrac * R_NSize);
        if (R_NSize < NNeeded)
             R_NSize = (NNeeded < R_MaxNSize) ? NNeeded: R_MaxNSize;
        if (R_NSize < orig_R_NSize)
             R_NSize = orig_R_NSize;
     }
    if (vect_occup > 1.0 && VNeeded < R_MaxVSize)
        R_VSize = VNeeded;
    if (vect_occup > R_VGrowFrac) {
        R_size_t change = (R_size_t)(R_VGrowIncrMin + R_VGrowIncrFrac
* R_VSize);
        if (R_MaxVSize - R_VSize >= change)
             R_VSize += change;
    }
    else if (vect_occup < R_VShrinkFrac) {
        R_VSize -= R_VShrinkIncrMin + R_VShrinkIncrFrac * R_VSize;
        if (R_VSize < VNeeded)
           R_VSize = VNeeded;
        if (R_VSize < orig_R_VSize)
           R_VSize = orig_R_VSize;
    }
    DEBUG_ADJUST_HEAP_PRINT(node_occup, vect_occup);
}
Rp->nsize is overridden at startup by environment variable R_NSIZE if
Min_Nsize <= $R_NSIZE <= Max_Nsize.  Rp->vsize is overridden at
startup by environment variable R_VSIZE if Min_Vsize <= $R_VSIZE
<Max_Vsize.  These are then used to set the global variables R_Nsize
and R_Vsize with R_SetMaxVSize(Rp->max_vsize).