Henrik Bengtsson
2019-Mar-18 01:23 UTC
[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()
(Bcc: CRAN) This is a proposal helping CRAN and alike as well as individual developers to avoid stray R processes being left behind that might be produced when an example or a package test fails to set up a parallel::makeCluster(). ISSUE If a package test sets up a PSOCK cluster and then the master process dies for one reason or the other, the PSOCK worker processes will remain running for 30 days ('timeout') until they timeout and terminate that way. When this happens on CRAN servers, where many packages are checked all the time, this will result in a lot of stray R processes. Here is an example illustrating how R leaves behind stray R processes if fails to establish a connection to one or more background R processes launched by 'parallel::makeCluster()'. First, let's make sure there are no other R processes running: $ ps aux | grep -E "exec[/]R" Then, lets create a PSOCK cluster for which connection will fail (because port 80 is reserved): $ Rscript -e 'parallel::makeCluster(1L, port=80)' Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : cannot open the connection Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection In addition: Warning message: In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : port 80 cannot be opened The launched R worker is still running: $ ps aux | grep -E "exec[/]R" hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00 /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2 592000 XDR=TRUE This process will keep running for 'TIMEOUT=2592000' seconds (= 30 days). The reason for this is that it is currently in the state where it attempts to set up a connection to the main R process: > parallel:::.slaveRSOCK function () { makeSOCKmaster <- function(master, port, setup_timeout, timeout, useXDR) { ... repeat { con <- tryCatch({ socketConnection(master, port = port, blocking = TRUE, open = "a+b", timeout = timeout) }, error = identity) ... } In other words, it is stuck in 'socketConnection()' and it won't time out until 'timeout' seconds. SUGGESTION To mitigate the problem with above stray processes from running 'R CMD check', we could shorten the 'timeout' which is currently hardcoded to 30 days (src/library/parallel/R/snow.R). By making it possible to control the default via environment variables, e.g. setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 2)), # 2 minutes timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60 * 24 * 30)), # 30 days it would be straightforward to adjust `R CMD check` to use, say, R_PARALLEL_SETUP_TIMEOUT=60 by default. This would cause any stray processes to time out after 60 seconds (instead of 30 days as now). /Henrik
Tomas Kalibera
2019-Mar-27 19:52 UTC
[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()
The problem causing the stray worker processes when the master fails to open a server socket to listen to connections from workers is not related to timeout in socketConnection(), because socketConnection() will fail right away. It is caused by a bug in checking the setup timeout (PR 17391). Fixed in 76275. Best Tomas On 3/18/19 2:23 AM, Henrik Bengtsson wrote:> (Bcc: CRAN) > > This is a proposal helping CRAN and alike as well as individual > developers to avoid stray R processes being left behind that might be > produced when an example or a package test fails to set up a > parallel::makeCluster(). > > > ISSUE > > If a package test sets up a PSOCK cluster and then the master process > dies for one reason or the other, the PSOCK worker processes will > remain running for 30 days ('timeout') until they timeout and > terminate that way. When this happens on CRAN servers, where many > packages are checked all the time, this will result in a lot of stray > R processes. > > Here is an example illustrating how R leaves behind stray R processes > if fails to establish a connection to one or more background R > processes launched by 'parallel::makeCluster()'. First, let's make > sure there are no other R processes running: > > $ ps aux | grep -E "exec[/]R" > > Then, lets create a PSOCK cluster for which connection will fail > (because port 80 is reserved): > > $ Rscript -e 'parallel::makeCluster(1L, port=80)' > Error in socketConnection("localhost", port = port, server = TRUE, > blocking = TRUE, : > cannot open the connection > Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection > In addition: Warning message: > In socketConnection("localhost", port = port, server = TRUE, > blocking = TRUE, : > port 80 cannot be opened > > The launched R worker is still running: > > $ ps aux | grep -E "exec[/]R" > hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00 > /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() > --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120 > TIMEOUT=2 592000 XDR=TRUE > > This process will keep running for 'TIMEOUT=2592000' seconds (= 30 > days). The reason for this is that it is currently in the state where > it attempts to set up a connection to the main R process: > > > parallel:::.slaveRSOCK > function () > { > makeSOCKmaster <- function(master, port, setup_timeout, timeout, > useXDR) { > ... > repeat { > con <- tryCatch({ > socketConnection(master, port = port, blocking = TRUE, > open = "a+b", timeout = timeout) > }, error = identity) > ... > } > > In other words, it is stuck in 'socketConnection()' and it won't time > out until 'timeout' seconds. > > > SUGGESTION > > To mitigate the problem with above stray processes from running 'R CMD > check', we could shorten the 'timeout' which is currently hardcoded to > 30 days (src/library/parallel/R/snow.R). By making it possible to > control the default via environment variables, e.g. > > setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 > * 2)), # 2 minutes > timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60 > * 24 * 30)), # 30 days > > it would be straightforward to adjust `R CMD check` to use, say, > > R_PARALLEL_SETUP_TIMEOUT=60 > > by default. This would cause any stray processes to time out after 60 > seconds (instead of 30 days as now). > > /Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Henrik Bengtsson
2019-Mar-28 04:20 UTC
[Rd] SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()
Thank you Tomas. For the record, I'm confirming that the stray background R worker process now times out properly after 'setup_timeout' (= 120) seconds: {0s}$ Rscript -e 'parallel::makeCluster(1L, port=80)' Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : cannot open the connection Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection In addition: Warning message: In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : port 80 cannot be opened Execution halted {1s}$ ps aux | grep -E "exec[/]R" hb 17645 2.0 0.3 259104 55144 pts/5 S 20:58 0:00 /home/hb/software/R-devel/trunk/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE {2s}$ sleep 120 {122s}$ ps aux | grep -E "exec[/]R" {122s}$ Good spotting of the bug: - if (Sys.time() - t0 > setup_timeout) break + if (difftime(Sys.time(), t0, units="secs") > setup_timeout) break For those who find this thread, I think what's going on here is that 'setup_timeout = 120' is a numeric that is compared a 'difftime' than keeps changing unit as times goes by. When compared as 'Sys.time() - t0 > setup_timeout' the LHS would be in units of seconds as long as less than 60 seconds had passed:> Sys.time() - t0Time difference of 59 secs> as.numeric(Sys.time() - t0)[1] 59 However, as soon as more than 60 seconds has passed, the unit turns into minutes and we're comparing minutes to seconds:> Sys.time() - t0Time difference of 1.016667 mins> as.numeric(Sys.time() - t0)[1] 1.016667 which is now compared to 'setup_timeout'. If the unit remained to be minutes it would timeout after 120 [minutes]. However, after 120 minutes, the unit of Sys.time() - t0 is in hours, and we're comparing hours to seconds, and so on. It would only timeout if we used 'setup_timeout' < 60 seconds. /Henrik On Wed, Mar 27, 2019 at 12:52 PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> > > The problem causing the stray worker processes when the master fails to > open a server socket to listen to connections from workers is not > related to timeout in socketConnection(), because socketConnection() > will fail right away. It is caused by a bug in checking the setup > timeout (PR 17391). > > Fixed in 76275. > > Best > Tomas > > On 3/18/19 2:23 AM, Henrik Bengtsson wrote: > > (Bcc: CRAN) > > > > This is a proposal helping CRAN and alike as well as individual > > developers to avoid stray R processes being left behind that might be > > produced when an example or a package test fails to set up a > > parallel::makeCluster(). > > > > > > ISSUE > > > > If a package test sets up a PSOCK cluster and then the master process > > dies for one reason or the other, the PSOCK worker processes will > > remain running for 30 days ('timeout') until they timeout and > > terminate that way. When this happens on CRAN servers, where many > > packages are checked all the time, this will result in a lot of stray > > R processes. > > > > Here is an example illustrating how R leaves behind stray R processes > > if fails to establish a connection to one or more background R > > processes launched by 'parallel::makeCluster()'. First, let's make > > sure there are no other R processes running: > > > > $ ps aux | grep -E "exec[/]R" > > > > Then, lets create a PSOCK cluster for which connection will fail > > (because port 80 is reserved): > > > > $ Rscript -e 'parallel::makeCluster(1L, port=80)' > > Error in socketConnection("localhost", port = port, server = TRUE, > > blocking = TRUE, : > > cannot open the connection > > Calls: <Anonymous> ... makePSOCKcluster -> newPSOCKnode -> socketConnection > > In addition: Warning message: > > In socketConnection("localhost", port = port, server = TRUE, > > blocking = TRUE, : > > port 80 cannot be opened > > > > The launched R worker is still running: > > > > $ ps aux | grep -E "exec[/]R" > > hb 20778 37.0 0.4 283092 70624 pts/0 S 17:50 0:00 > > /usr/lib/R/bin/exec/R --slave --no-restore -e parallel:::.slaveRSOCK() > > --args MASTER=localhost PORT=80 OUT=/dev/null SETUPTIMEOUT=120 > > TIMEOUT=2 592000 XDR=TRUE > > > > This process will keep running for 'TIMEOUT=2592000' seconds (= 30 > > days). The reason for this is that it is currently in the state where > > it attempts to set up a connection to the main R process: > > > > > parallel:::.slaveRSOCK > > function () > > { > > makeSOCKmaster <- function(master, port, setup_timeout, timeout, > > useXDR) { > > ... > > repeat { > > con <- tryCatch({ > > socketConnection(master, port = port, blocking = TRUE, > > open = "a+b", timeout = timeout) > > }, error = identity) > > ... > > } > > > > In other words, it is stuck in 'socketConnection()' and it won't time > > out until 'timeout' seconds. > > > > > > SUGGESTION > > > > To mitigate the problem with above stray processes from running 'R CMD > > check', we could shorten the 'timeout' which is currently hardcoded to > > 30 days (src/library/parallel/R/snow.R). By making it possible to > > control the default via environment variables, e.g. > > > > setup_timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 > > * 2)), # 2 minutes > > timeout = as.numeric(Sys.getenv("R_PARALLEL_SETUP_TIMEOUT", 60 * 60 > > * 24 * 30)), # 30 days > > > > it would be straightforward to adjust `R CMD check` to use, say, > > > > R_PARALLEL_SETUP_TIMEOUT=60 > > > > by default. This would cause any stray processes to time out after 60 > > seconds (instead of 30 days as now). > > > > /Henrik > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > >