Henrik Bengtsson
2018-Mar-09 03:05 UTC
[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
BACKGROUND: While troubleshooting random, occasionally occurring, errors from parallel::makePSOCKcluster("localhost", port = 11000); Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : cannot open the connection I had another look at parallel:::newPSOCKnode(), which is used internally to set up each background worker. It is designed to, first launch the background worker as: system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE', wait = FALSE) which immediately tries to connect to a socket on localhost:11000 with timeout. Immediately after the master launched the above (without waiting), it will set up the connection waiting for the connect from the background worker: con <- socketConnection("localhost", port = 11000, server = TRUE, blocking = TRUE, open = "a+b", timeout = timeout) ISSUE: If we emulate the above process, and remove the OUT=/dev/null such that we can see the output produces by the worker, as: setup <- function(delay = 0) { system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE) Sys.sleep(delay) socketConnection("localhost", port = 11000, server = TRUE, blocking = TRUE, open = "a+b", timeout = 20) } doing:> con <- setup(0)starting worker pid=24983 on localhost:11000 at 18:44:30.087 will most likely work, but adding a delay:> con <- setup(5)starting worker pid=25099 on localhost:11000 at 18:45:23.617 Warning in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : localhost:11000 cannot be opened Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : cannot open the connection Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> socketConnection will produce an *instant* error on the worker, and before master opens the server socket. Eventually, master will produce the originally observed error: Error in socketConnection("localhost", port = 11000, server = TRUE, blocking = TRUE, : cannot open the connection In other words, if the master fails to setup socketConnection() *before* the background workers attempts to connect, it all fails. Such a delay may happen for instance when there is a large CPU load on the test machine. Is this the above bug? /Henrik PS. The background is that I, very occasionally, observe R CMD check error on the above (on CRAN and elsewhere) when testing my future package. The error always go away when retested. This far I've though this is due to port clashes (since the port is random selected in [11000:11999]) and accepted that it happens. However, after discovering the above, it could be due to the worker launching "too soon".> sessionInfo()R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS Matrix products: default BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.3
Henrik Bengtsson
2018-Mar-09 03:36 UTC
[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to socketConnection() as a character, i.e. there's a missing timeout <- as.integer(timeout), cf. port <- as.integer(port) and useXDR <- as.logical(value):> parallel:::.slaveRSOCKfunction () { makeSOCKmaster <- function(master, port, timeout, useXDR) { port <- as.integer(port) con <- socketConnection(master, port = port, blocking = TRUE, open = "a+b", timeout = timeout) structure(list(con = con), class = if (useXDR) "SOCKnode" else "SOCK0node") } master <- "localhost" port <- NA_integer_ outfile <- Sys.getenv("R_SNOW_OUTFILE") methods <- TRUE useXDR <- TRUE for (a in commandArgs(TRUE)) { pos <- regexpr("=", a) name <- substr(a, 1L, pos - 1L) value <- substr(a, pos + 1L, nchar(a)) switch(name, MASTER = { master <- value }, PORT = { port <- value }, OUT = { outfile <- value }, TIMEOUT = { timeout <- value }, XDR = { useXDR <- as.logical(value) }) } if (is.na(port)) stop("PORT must be specified") sinkWorkerOutput(outfile) msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(), paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3")) cat(msg) slaveLoop(makeSOCKmaster(master, port, timeout, useXDR)) } <bytecode: 0x4bd4b58> <environment: namespace:parallel> Yet, fix that does *not* seem to change anything. /Henrik On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:> BACKGROUND: > > While troubleshooting random, occasionally occurring, errors from > parallel::makePSOCKcluster("localhost", port = 11000); > > Error in socketConnection("localhost", port = port, server = TRUE, > blocking = TRUE, : > cannot open the connection > > I had another look at parallel:::newPSOCKnode(), which is used > internally to set up each background worker. It is designed to, first > launch the background worker as: > > system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args > MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE', > wait = FALSE) > > which immediately tries to connect to a socket on localhost:11000 with > timeout. Immediately after the master launched the above (without > waiting), it will set up the connection waiting for the connect from > the background worker: > > con <- socketConnection("localhost", port = 11000, server = TRUE, > blocking = TRUE, open = "a+b", timeout = timeout) > > > ISSUE: > > If we emulate the above process, and remove the OUT=/dev/null such > that we can see the output produces by the worker, as: > > setup <- function(delay = 0) { > system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args > MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE) > Sys.sleep(delay) > socketConnection("localhost", port = 11000, server = TRUE, blocking > = TRUE, open = "a+b", timeout = 20) > } > > doing: > >> con <- setup(0) > starting worker pid=24983 on localhost:11000 at 18:44:30.087 > > will most likely work, but adding a delay: > >> con <- setup(5) > starting worker pid=25099 on localhost:11000 at 18:45:23.617 > Warning in socketConnection(master, port = port, blocking = TRUE, open > = "a+b", : > localhost:11000 cannot be opened > Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : > cannot open the connection > Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> > socketConnection > > will produce an *instant* error on the worker, and before master opens > the server socket. Eventually, master will produce the originally > observed error: > > Error in socketConnection("localhost", port = 11000, server = TRUE, > blocking = TRUE, : > cannot open the connection > > In other words, if the master fails to setup socketConnection() > *before* the background workers attempts to connect, it all fails. > Such a delay may happen for instance when there is a large CPU load on > the test machine. > > Is this the above bug? > > /Henrik > > PS. The background is that I, very occasionally, observe R CMD check > error on the above (on CRAN and elsewhere) when testing my future > package. The error always go away when retested. This far I've though > this is due to port clashes (since the port is random selected in > [11000:11999]) and accepted that it happens. However, after > discovering the above, it could be due to the worker launching "too > soon". > >> sessionInfo() > R version 3.4.3 (2017-11-30) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 16.04.4 LTS > > Matrix products: default > BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 > LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.4.3
Henrik Bengtsson
2018-Mar-09 05:58 UTC
[Rd] parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
A solution is to have parallel:::.slaveRSOCK() attempt to connect multiple times before failing, e.g. makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries = 10L, interval = 1.0) { port <- as.integer(port) for (i in seq_len(maxTries)) { con <- tryCatch({ socketConnection(master, port = port, blocking = TRUE, open = "a+b", timeout = timeout) }, error = identity) if (inherits(con, "connection")) break Sys.sleep(interval) } if (inherits(con, "error")) stop(con) structure(list(con = con), class = if (useXDR) "SOCKnode" else "SOCK0node") } One could set 'maxTries' and 'interval' via commandArgs() like what is done for the other arguments. I'm happy to submit an SVN patch if R core thinks this is an acceptable solution. /Henrik On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to > socketConnection() as a character, i.e. there's a missing timeout <- > as.integer(timeout), cf. port <- as.integer(port) and useXDR <- > as.logical(value): > >> parallel:::.slaveRSOCK > function () > { > makeSOCKmaster <- function(master, port, timeout, useXDR) { > port <- as.integer(port) > con <- socketConnection(master, port = port, blocking = TRUE, > open = "a+b", timeout = timeout) > structure(list(con = con), class = if (useXDR) > "SOCKnode" > else "SOCK0node") > } > master <- "localhost" > port <- NA_integer_ > outfile <- Sys.getenv("R_SNOW_OUTFILE") > methods <- TRUE > useXDR <- TRUE > for (a in commandArgs(TRUE)) { > pos <- regexpr("=", a) > name <- substr(a, 1L, pos - 1L) > value <- substr(a, pos + 1L, nchar(a)) > switch(name, MASTER = { > master <- value > }, PORT = { > port <- value > }, OUT = { > outfile <- value > }, TIMEOUT = { > timeout <- value > }, XDR = { > useXDR <- as.logical(value) > }) > } > if (is.na(port)) > stop("PORT must be specified") > sinkWorkerOutput(outfile) > msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(), > paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3")) > cat(msg) > slaveLoop(makeSOCKmaster(master, port, timeout, useXDR)) > } > <bytecode: 0x4bd4b58> > <environment: namespace:parallel> > > Yet, fix that does *not* seem to change anything. > > /Henrik > > On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson > <henrik.bengtsson at gmail.com> wrote: >> BACKGROUND: >> >> While troubleshooting random, occasionally occurring, errors from >> parallel::makePSOCKcluster("localhost", port = 11000); >> >> Error in socketConnection("localhost", port = port, server = TRUE, >> blocking = TRUE, : >> cannot open the connection >> >> I had another look at parallel:::newPSOCKnode(), which is used >> internally to set up each background worker. It is designed to, first >> launch the background worker as: >> >> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args >> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE', >> wait = FALSE) >> >> which immediately tries to connect to a socket on localhost:11000 with >> timeout. Immediately after the master launched the above (without >> waiting), it will set up the connection waiting for the connect from >> the background worker: >> >> con <- socketConnection("localhost", port = 11000, server = TRUE, >> blocking = TRUE, open = "a+b", timeout = timeout) >> >> >> ISSUE: >> >> If we emulate the above process, and remove the OUT=/dev/null such >> that we can see the output produces by the worker, as: >> >> setup <- function(delay = 0) { >> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args >> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE) >> Sys.sleep(delay) >> socketConnection("localhost", port = 11000, server = TRUE, blocking >> = TRUE, open = "a+b", timeout = 20) >> } >> >> doing: >> >>> con <- setup(0) >> starting worker pid=24983 on localhost:11000 at 18:44:30.087 >> >> will most likely work, but adding a delay: >> >>> con <- setup(5) >> starting worker pid=25099 on localhost:11000 at 18:45:23.617 >> Warning in socketConnection(master, port = port, blocking = TRUE, open >> = "a+b", : >> localhost:11000 cannot be opened >> Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", : >> cannot open the connection >> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> >> socketConnection >> >> will produce an *instant* error on the worker, and before master opens >> the server socket. Eventually, master will produce the originally >> observed error: >> >> Error in socketConnection("localhost", port = 11000, server = TRUE, >> blocking = TRUE, : >> cannot open the connection >> >> In other words, if the master fails to setup socketConnection() >> *before* the background workers attempts to connect, it all fails. >> Such a delay may happen for instance when there is a large CPU load on >> the test machine. >> >> Is this the above bug? >> >> /Henrik >> >> PS. The background is that I, very occasionally, observe R CMD check >> error on the above (on CRAN and elsewhere) when testing my future >> package. The error always go away when retested. This far I've though >> this is due to port clashes (since the port is random selected in >> [11000:11999]) and accepted that it happens. However, after >> discovering the above, it could be due to the worker launching "too >> soon". >> >>> sessionInfo() >> R version 3.4.3 (2017-11-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 16.04.4 LTS >> >> Matrix products: default >> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 >> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.4.3
Maybe Matching Threads
- parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
- parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
- parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
- parallel:::newPSOCKnode(): background worker fails immediately if socket on master is not set up in time (BUG?)
- SUGGESTION: Proposal to mitigate problem with stray processes left behind by parallel::makeCluster()