Jocelyn Ireson-Paine
2019-Oct-17 09:28 UTC
[R] Surprisingly large amount of memory used by tibble with lots of nested tibbles within
I'm using the Tidyverse group_nest() function to nest data about families and people within households, and have found that this seems to use astonishing quantities of memory. It's more than I'd expect from the number of nested tibbles created. I'll outline what was happening with my actual data, then show a reproducible example. My data comes from the British Labour Force Survey. It's a flat file, storable as CSV, representing households. Each record represents a person, with variables such as age, sex, income, health, employment. Records have a household ID, which groups them into households; and a family ID, grouping into families within a household. A fair number of households have more than one family; and many families have more than one person. So the data has a hierarchical structure of people within families within households. I need to process it at all three levels. Some benefit calculations, for example, need doing per person; but I also need to aggregate over families and households. It seemed obvious that using group_nest() would make this easy. Indeed it does, but at the expense of memory. Whereas one file of 4000 households and about 40 variables occupies 1.44 MB unnested, it blows up to 19.8 MB once double-nested so people are within families are within households. Also surprising is that the nesting makes saveRDS() much much slower. Saving the unnested data is almost instantaneous; saving the nested takes over 10 minutes. Reading it is also slow. I'll now show my reproducible example. For this, I created a 10,000-row tibble with 15 data columns, all generated by runif() . I then added a grouping column to indicate which rows could be regarded as in the same group. This can be varied, so I can have every row in its own group, or all rows in the same group, or somewhere in between. I then called group_nest() on this and looked at the memory used by the result. Actually, I did this inside a function, and called it with different numbers of groups, to see how memory usage varied with number of nested tibbles. Each row's group ID was created by remaindering (via %%) on its row number. First, my source: library( tidyverse ) library( pryr ) library( lobstr ) library( glue ) library( microbenchmark ) library( assertthat ) investigate_nesting_effect <- function( len, ngroups ) { t <- tibble( id=1:len , a=runif(len), b=runif(len), c=runif(len), d=runif(len), e=runif(len) , f=runif(len), g=runif(len), h=runif(len), i=runif(len), j=runif(len) , k=runif(len), l=runif(len), m=runif(len), n=runif(len), o=runif(len) ) t $ group_id <- 1:len %% ngroups tg <- t %>% group_by( group_id ) tgn <- tg %>% group_nest( keep=FALSE ) tgnun <- tgn %>% unnest() assert_that( are_equal( t, tgnun, tol=0.001 ) ) print( glue( "Length={len}, ngroups={ngroups}, nrow={nrow(tgn)}, mem={object_size( tgn )}" ) ) # res <- microbenchmark( saveRDS( tgn, str_c( "data/tgn_", ngroups ) ) # , times=5 # ) # # print( res ) tgn } for ( ngroups in c( 1, 3, 10, 30, 100, 300, 1000, 3000, 10000 ) ) { investigate_nesting_effect( 10000, ngroups ) } In this, ngroups is the number of groups to create. The first time round the loop, all rows end up nested within one tibble; the final time, each row is nested within its own tibble. In the function, t is the original unnested tibble; tg is it grouped; tgn is it nested. tgnun is it unnested again. tgnun sanity-checks my code by asserting that nesting and then unnesting gives the original. Now my results: Length=10000, ngroups=1, nrow=1, mem=1,244,528 Length=10000, ngroups=3, nrow=3, mem=1,246,920 Length=10000, ngroups=10, nrow=10, mem=1,255,280 Length=10000, ngroups=30, nrow=30, mem=1,278,944 Length=10000, ngroups=100, nrow=100, mem=1,361,744 Length=10000, ngroups=300, nrow=300, mem=1,599,344 Length=10000, ngroups=1000, nrow=1000, mem=3,155,344 Length=10000, ngroups=3000, nrow=3000, mem=5,043,344 Length=10000, ngroups=10000, nrow=10000, mem=13,123,344 I've manually inserted commas into the memory figures to make them easier to understand. Note that nesting every one of those 10,000 rows into a tibble adds about 12 MB to the original. So that's very roughly about 1K added per nested tibble, which seems a lot. The code also contains some commented-out timings on saveRDS() . These didn't show the same time blow-up that I experienced with my household data, so I still need to replicate that reproducibly. It's equally annoying, as it means users have to wait so long for data to be loaded. Any thoughts would be welcome, including faults in my code above. For what it's worth, here's my sessionInfo() : R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 16299) Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scales_1.0.0 htmlTable_1.13.1 ggrepel_0.8.1 [4] glue_1.3.1 magrittr_1.5 igraph_1.2.4.1 [7] yaml_2.2.0 haven_2.1.1 pryr_0.1.4 [10] readxl_1.3.1 fs_1.3.1 memo_1.0.1 [13] Biobase_2.44.0 BiocGenerics_0.30.0 lubridate_1.7.4 [16] DT_0.7 shinyjs_1.0 shinyWidgets_0.4.8 [19] shiny_1.3.2 assertthat_0.2.1 forcats_0.4.0 [22] stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2 [25] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 [28] ggplot2_3.2.0 tidyverse_1.2.1 conflicted_1.0.4 [31] BiocManager_1.30.4 loaded via a namespace (and not attached): [1] nlme_3.1-140 usethis_1.5.1 devtools_2.1.0 httr_1.4.0 [5] rprojroot_1.3-2 tools_3.6.1 backports_1.1.4 R6_2.4.0 [9] lazyeval_0.2.2 colorspace_1.4-1 withr_2.1.2 tidyselect_0.2.5 [13] prettyunits_1.0.2 processx_3.4.0 curl_3.3 compiler_3.6.1 [17] cli_1.1.0 rvest_0.3.4 xml2_1.2.0 desc_1.2.0 [21] checkmate_1.9.4 callr_3.3.0 digest_0.6.20 pkgconfig_2.0.2 [25] htmltools_0.3.6 sessioninfo_1.1.1 htmlwidgets_1.3 rlang_0.4.0 [29] rstudioapi_0.10 generics_0.0.2 jsonlite_1.6 crosstalk_1.0.0 [33] Rcpp_1.0.1 munsell_0.5.0 stringi_1.4.3 pkgbuild_1.0.3 [37] grid_3.6.1 promises_1.0.1 crayon_1.3.4 lattice_0.20-38 [41] hms_0.5.0 zeallot_0.1.0 knitr_1.23 ps_1.3.0 [45] pillar_1.4.2 codetools_0.2-16 pkgload_1.0.2 remotes_2.1.0 [49] modelr_0.1.4 vctrs_0.2.0 httpuv_1.5.1 testthat_2.1.1 [53] cellranger_1.1.0 gtable_0.3.0 xfun_0.8 mime_0.7 [57] xtable_1.8-4 broom_0.5.2 later_0.8.0 memoise_1.1.0 -Joc- [[alternative HTML version deleted]]