Andreas Kersting
2021-Oct-07 07:26 UTC
[Rd] GC: parallelizing the CHARSXP cache maintenance
Hi all, As part of RunGenCollect() (in src/main/memory.c), some maintenance on the CHARSXP cache is done, namely unmarked nodes/CHARSXPs are removed from the hash chains. This requires always touching all CHARSXP in the cache, irrespective of the number of generations which were just garbage collected. In a session with a big CHARSXP cache, this will significantly slow down gc also when just collecting the youngest generation. However, this part of RunGenCollect() seems to be one of the few which can easily be parallelized without the need for thread synchronization. And it seems to be the one most profiting from parallelization. Attached patch (?parallel_CHARSXP_cache.diff) implements parallelization over the elements of R_StringHash and gives the following performance improvements on my system when using 4 threads compared to R devel (revision 81008): Elapsed time for 200 non-full gc in a session after x <- as.character(runif(1e6))[] gc(full = TRUE) 8sec -> 2.5sec. AND Elapsed time for five non-full gc in a session after x <- as.character(runif(5e7))[] gc(full = TRUE) 21sec -> 6sec. In the patch, I dropped the two lines FORWARD_NODE(s); FORWARD_NODE(CXHEAD(s)); because they are currently both no-ops (and would require synchronization if they were not). They are no-ops because we have ?# define CXHEAD(x) (x) // in Defn.h and hence FORWARD_NODE(s)/FORWARD_NODE(CXHEAD(s)) is only called when s is already marked, in which case FORWARD_NODE() does nothing. I used OpenMP despite the known issues of some of its implementations with hanging after a fork, mostly because it was the easiest thing to do for a PoC. I worked around this similar to e.g. data.table by using only one thread in forked children. It might be worth considering making the parallelization conditional on the size of the CHARSXP cache and use only the main thread if the cache is (still) small. In the second attached patch (parallel_CHARSXP_cache_no_forward.diff) I additionally no longer call FORWARD_NODE(R_StringHash) because this will make the following call to PROCESS_NODES() iterate through all elements of R_StringHash again which is unnecessary since all elements are either R_NilValue or an already marked CHARSXP. I rather directly mark & snap R_StringHash. In contrast to the parallelization, this only affects full gcs since R_StringHash will quickly belong to the oldest generation. Attached gc_test.R is the script I used to get the previously mentioned and more gc timings. To me this looks like a significant performance improvement, especially given the little changeset. What do you think? Best regards, Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: parallel_CHARSXP_cache.diff Type: text/x-patch Size: 2424 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20211007/34bed55c/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: parallel_CHARSXP_cache_no_forward.diff Type: text/x-patch Size: 3185 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20211007/34bed55c/attachment-0001.bin>