On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <andy at yovo.org>
wrote:>
> Those are good points, Duncan. I am experimenting with a nice checkpointing
tool called DMTCP. It operates on the system level but is quite OS-dependent. It
can be found at http://dmtcp.sourceforge.net/index.html.
>
> Still, it would be nice to be able to checkpoint calls within R to
potentially long-running processes like optim().
Teasing idea. Imagine if we could come up with some de-facto standard
API for this and that such a framework could be called automatically
by R. Something similar to how user interrupts are checked (e.g.
R_CheckUserInterrupt()) on a regular basis by the R engine and
through-out the R code. That could help troubleshooting and debugging,
e.g. sending the checkpoint to someone else or going backwards in
time.
Pasting in the below since I failed to hit Reply *All* the other day,
and it was only Richard who got it:
A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp). I'm
sharing in case someone is interested in investigating this further.
Also, somewhere on the DMTCP wiki, they asked for testing with R by
more experienced users.
"DMTCP is a tool to transparently checkpoint the state of multiple
simultaneous applications, including multi-threaded and distributed
applications. It operates directly on the user binary executable,
without any Linux kernel modules or other kernel modifications."
They seem to be able to run this with HPC jobs, open files, Linux
containers, and even MPI, and so on. I've only tested it very quickly
with interactive R and it seems to work. Obviously more testing needs
to be done to identify when it doesn't work. For example, I'd have a
hard time it would work out of the box with local parallel PSOCK
workers. They mention "plug-ins", so maybe there's a way to
adding
support for specific use cases on a one by one.
Different academic HPC environment appear to use it, e.g.
* https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
* http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
* https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP
That's all I have time for now,
Henrik
>
> -Andy
>
> On 12/13/21 11:51 AM, Duncan Murdoch wrote:
> > On 13/12/2021 12:58 p.m., Greg Minshall wrote:
> >> Jeff,
> >>
> >>> This sounds like an OS feature, not an R feature... certainly
not a
> >>> portable R feature.
> >>
> >> i'm not arguing for it, but this seems to me like something
that could
> >> be a language feature.
> >>
> >
> > R functions can call libraries written in other languages, and can
start processes, etc. R doesn't know everything going on in every function
call, and would have a lot of trouble saving it.
> >
> > If you added some limitations, e.g. a process that periodically has
its entire state stored in R variables, then it would be a lot easier.
> >
> > Duncan Murdoch
>
> --
> Andy Jacobson
> andy at yovo.org
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.