iuke-tier@ey m@iii@g oii uiow@@edu
2020-Dec-13  03:26 UTC
[Rd] [External] R crashes when using huge data sets with character string variables
If R is receiving a kill signal there is nothing it can do about it. I am guessing you are running into a memory over-commit issue in your OS. https://en.wikipedia.org/wiki/Memory_overcommitment https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/ If you have to run this close to your physical memory limits you might try using your shell's facility (ulimit for bash, limit for some others) to limit process memory/virtual memory use to your available physical memory. You can also try setting the R_MAX_VSIZE environment variable mentioned in ?Memory; that only affects the R heap, not malloc() done elsewhere. Best, luke On Sat, 12 Dec 2020, Arne Henningsen wrote:> When working with a huge data set with character string variables, I > experienced that various commands let R crash. When I run R in a > Linux/bash console, R terminates with the message "Killed". When I use > RStudio, I get the message "R Session Aborted. R encountered a fatal > error. The session was terminated. Start New Session". If an object in > the R workspace needs too much memory, I would expect that R would not > crash but issue an error message "Error: cannot allocate vector of > size ...". A minimal reproducible example (at least on my computer) > is: > > nObs <- 1e9 > > date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs, > 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" ) > > Is this a bug or a feature of R? > > Some information about my R version, OS, etc: > > R> sessionInfo() > R version 4.0.3 (2020-10-10) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.1 LTS > > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 > > locale: > [1] LC_CTYPE=en_DK.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_DK.UTF-8 LC_COLLATE=en_DK.UTF-8 > [5] LC_MONETARY=en_DK.UTF-8 LC_MESSAGES=en_DK.UTF-8 > [7] LC_PAPER=en_DK.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_4.0.3 > > /Arne > >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
Dirk Eddelbuettel
2020-Dec-13  04:17 UTC
[Rd] [External] R crashes when using huge data sets with character string variables
On 12 December 2020 at 21:26, luke-tierney at uiowa.edu wrote: | If R is receiving a kill signal there is nothing it can do about it. | | I am guessing you are running into a memory over-commit issue in your OS. | https://en.wikipedia.org/wiki/Memory_overcommitment | https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/ | | If you have to run this close to your physical memory limits you might | try using your shell's facility (ulimit for bash, limit for some | others) to limit process memory/virtual memory use to your available | physical memory. You can also try setting the R_MAX_VSIZE environment | variable mentioned in ?Memory; that only affects the R heap, not | malloc() done elsewhere. Similarly, as it is Linux, you could (easily) add virtual memory via a swapfile (see 'man 8 swapfile' and 'man 8 swapon'). But even then, I expect this to be slow -- 1e9 is a lot. I have 32gb and ample swap (which is rarely used, but a safety net). When I use your code with nObs <- 1e8 it ends up with about 6gb which poses poses no problem, but already takes 3 1/2 minutes:> nObs <- 1e8 > system.time(date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs, 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" ))user system elapsed 203.723 1.779 205.528>You may want to play with the nObs value to see exactly where it breaks on your box. Dirk -- https://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
IƱaki Ucar
2020-Dec-13  10:17 UTC
[Rd] [External] R crashes when using huge data sets with character string variables
On Sun, 13 Dec 2020 at 04:27, <luke-tierney at uiowa.edu> wrote:> > If R is receiving a kill signal there is nothing it can do about it. > > I am guessing you are running into a memory over-commit issue in your OS. > https://en.wikipedia.org/wiki/Memory_overcommitment > https://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/Correct. And in particular, this is most probably the earlyoom [1] service in action, which, I believe, is installed and enabled by default in Ubuntu 20.04. It is a simple daemon that monitors memory, and when some conditions are reached (e.g., the system is about to start swapping), it looks for offending processes and kills them. [1] https://github.com/rfjakob/earlyoom I?aki> If you have to run this close to your physical memory limits you might > try using your shell's facility (ulimit for bash, limit for some > others) to limit process memory/virtual memory use to your available > physical memory. You can also try setting the R_MAX_VSIZE environment > variable mentioned in ?Memory; that only affects the R heap, not > malloc() done elsewhere. > > Best, > > luke > > On Sat, 12 Dec 2020, Arne Henningsen wrote: > > > When working with a huge data set with character string variables, I > > experienced that various commands let R crash. When I run R in a > > Linux/bash console, R terminates with the message "Killed". When I use > > RStudio, I get the message "R Session Aborted. R encountered a fatal > > error. The session was terminated. Start New Session". If an object in > > the R workspace needs too much memory, I would expect that R would not > > crash but issue an error message "Error: cannot allocate vector of > > size ...". A minimal reproducible example (at least on my computer) > > is: > > > > nObs <- 1e9 > > > > date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs, > > 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" ) > > > > Is this a bug or a feature of R? > > > > Some information about my R version, OS, etc: > > > > R> sessionInfo() > > R version 4.0.3 (2020-10-10) > > Platform: x86_64-pc-linux-gnu (64-bit) > > Running under: Ubuntu 20.04.1 LTS > > > > Matrix products: default > > BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 > > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 > > > > locale: > > [1] LC_CTYPE=en_DK.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_DK.UTF-8 LC_COLLATE=en_DK.UTF-8 > > [5] LC_MONETARY=en_DK.UTF-8 LC_MESSAGES=en_DK.UTF-8 > > [7] LC_PAPER=en_DK.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > loaded via a namespace (and not attached): > > [1] compiler_4.0.3 > > > > /Arne > > > > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- I?aki ?car