Henrik Bengtsson
2021-Aug-12 08:22 UTC
[Rd] Force quitting a FORK cluster node on macOS and Solaris wreaks havoc
The following smells like a bug in R to me, because it puts the main R session into an unstable state. Consider the following R script: a <- 42 message("a=", a) cl <- parallel::makeCluster(1L, type="FORK") try(parallel::clusterEvalQ(cl, quit(save="no"))) message("parallel:::isChild()=", parallel:::isChild()) message("a=", a) rm(a) The purpose of this was to emulate what happens when an parallel workers crashes. Now, if you source() the above on macOS, you might(*) end up with:> a <- 42 > message("a=", a)a=42> cl <- parallel::makeCluster(1L, type="FORK") > try(parallel::clusterEvalQ(cl, quit(save="no")))Error: Error in unserialize(node$con) : error reading from connection> message("parallel:::isChild()=", parallel:::isChild())parallel:::isChild()=FALSE> message("a=", a)a=42> rm(a) > try(parallel::clusterEvalQ(cl, quit(save="no")))Error: Error in unserialize(node$con) : error reading from connection> message("parallel:::isChild()=", parallel:::isChild())parallel:::isChild()=FALSE> message("a=", a)Error: Error in message("a=", a) : object 'a' not found Execution halted Note how 'rm(a)' is supposed to be the last line of code to be evaluated. However, the force quitting of the FORK cluster node appears to result in the main code being evaluated twice (in parallel?). (*) This does not happen on all macOS variants. For example, it works fine on CRAN's 'r-release-macos-x86_64' but it does give the above behavior on 'r-release-macos-arm64'. I can reproduce it on GitHub Actions (https://github.com/HenrikBengtsson/teeny/runs/3309235106?check_suite_focus=true#step:10:219) but not on R-hub's 'macos-highsierra-release' and 'macos-highsierra-release-cran'. I can also reproduce it on R-hub's 'solaris-x86-patched' and solaris-x86-patched-ods' machines. However, I still haven't found a Linux machine where this happens. If one replaces quit(save="no") with tools::pskill(Sys.getpid()) or parallel:::mcexit(0L), this behavior does not take place (at least not on GitHub Actions and R-hub). I don't have access to a macOS or a Solaris machine, so I cannot investigate further myself. For example, could it be an issue with quit(), or does is it possible to trigger by other means? And more importantly, should this be fixed? Also, I'd be curious what happens if you run the above in an interactive R session. /Henrik
Simon Urbanek
2021-Aug-12 22:58 UTC
[Rd] Force quitting a FORK cluster node on macOS and Solaris wreaks havoc
Henrik, I'm not quite sure I understand the report to be honest. Just a quick comment here - using quit() in a forked child is not allowed, because the R clean-up is only intended for the master as it will be blowing away the master's state, connections, working directory, running master's exit handlers etc. That's why the children have to use either abort or mcexit() to terminate - which is what mcparallel() does. If you use q() a lot of things go wrong no matter the platform - e.g. try using ? in the master session after sourcing your code. Cheers, Simon> On 12/08/2021, at 8:22 PM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote: > > The following smells like a bug in R to me, because it puts the main R > session into an unstable state. Consider the following R script: > > a <- 42 > message("a=", a) > cl <- parallel::makeCluster(1L, type="FORK") > try(parallel::clusterEvalQ(cl, quit(save="no"))) > message("parallel:::isChild()=", parallel:::isChild()) > message("a=", a) > rm(a) > > The purpose of this was to emulate what happens when an parallel > workers crashes. > > Now, if you source() the above on macOS, you might(*) end up with: > >> a <- 42 >> message("a=", a) > a=42 >> cl <- parallel::makeCluster(1L, type="FORK") >> try(parallel::clusterEvalQ(cl, quit(save="no"))) > Error: Error in unserialize(node$con) : error reading from connection >> message("parallel:::isChild()=", parallel:::isChild()) > parallel:::isChild()=FALSE >> message("a=", a) > a=42 >> rm(a) >> try(parallel::clusterEvalQ(cl, quit(save="no"))) > Error: Error in unserialize(node$con) : error reading from connection >> message("parallel:::isChild()=", parallel:::isChild()) > parallel:::isChild()=FALSE >> message("a=", a) > Error: Error in message("a=", a) : object 'a' not found > Execution halted > > Note how 'rm(a)' is supposed to be the last line of code to be > evaluated. However, the force quitting of the FORK cluster node > appears to result in the main code being evaluated twice (in > parallel?). > > (*) This does not happen on all macOS variants. For example, it works > fine on CRAN's 'r-release-macos-x86_64' but it does give the above > behavior on 'r-release-macos-arm64'. I can reproduce it on GitHub > Actions (https://github.com/HenrikBengtsson/teeny/runs/3309235106?check_suite_focus=true#step:10:219) > but not on R-hub's 'macos-highsierra-release' and > 'macos-highsierra-release-cran'. I can also reproduce it on R-hub's > 'solaris-x86-patched' and solaris-x86-patched-ods' machines. However, > I still haven't found a Linux machine where this happens. > > If one replaces quit(save="no") with tools::pskill(Sys.getpid()) or > parallel:::mcexit(0L), this behavior does not take place (at least not > on GitHub Actions and R-hub). > > I don't have access to a macOS or a Solaris machine, so I cannot > investigate further myself. For example, could it be an issue with > quit(), or does is it possible to trigger by other means? And more > importantly, should this be fixed? Also, I'd be curious what happens > if you run the above in an interactive R session. > > /Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >