I have been using DMTCP successfully for a long-running optim() task. This is a
single-core process running on a large linux cluster with slurm as the job
manager. This cluster places an 8-hour limit on individual jobs, and since my
cost function takes 11 minutes to compute, I need many such jobs run
sequentially. To make DMTCP work, I have had to rework file I/O to avoid
references to temporary files written to /tmp, but other than that...optim() is
checkpointed just before 8 hours is up, and then resumed successfully in a
subsequent batch job running on a different core of the cluster.
While I have an answer for my particular task, it would still be useful to
checkpoint using the scheme Henrik suggests. Thanks all for the interesting
conversation!
-Andy
On 12/14/21 5:39 PM, Henrik Bengtsson wrote:> On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <andy at yovo.org>
wrote:
>>
>> Those are good points, Duncan. I am experimenting with a nice
checkpointing tool called DMTCP. It operates on the system level but is quite
OS-dependent. It can be found at http://dmtcp.sourceforge.net/index.html.
>>
>> Still, it would be nice to be able to checkpoint calls within R to
potentially long-running processes like optim().
>
> Teasing idea. Imagine if we could come up with some de-facto standard
> API for this and that such a framework could be called automatically
> by R. Something similar to how user interrupts are checked (e.g.
> R_CheckUserInterrupt()) on a regular basis by the R engine and
> through-out the R code. That could help troubleshooting and debugging,
> e.g. sending the checkpoint to someone else or going backwards in
> time.
>
> Pasting in the below since I failed to hit Reply *All* the other day,
> and it was only Richard who got it:
>
> A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
> CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp). I'm
> sharing in case someone is interested in investigating this further.
> Also, somewhere on the DMTCP wiki, they asked for testing with R by
> more experienced users.
>
> "DMTCP is a tool to transparently checkpoint the state of multiple
> simultaneous applications, including multi-threaded and distributed
> applications. It operates directly on the user binary executable,
> without any Linux kernel modules or other kernel modifications."
>
> They seem to be able to run this with HPC jobs, open files, Linux
> containers, and even MPI, and so on. I've only tested it very quickly
> with interactive R and it seems to work. Obviously more testing needs
> to be done to identify when it doesn't work. For example, I'd have
a
> hard time it would work out of the box with local parallel PSOCK
> workers. They mention "plug-ins", so maybe there's a way to
adding
> support for specific use cases on a one by one.
>
> Different academic HPC environment appear to use it, e.g.
>
> * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
> * http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
> * https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP
>
> That's all I have time for now,
>
> Henrik
>
>>
>> -Andy
>>
>> On 12/13/21 11:51 AM, Duncan Murdoch wrote:
>>> On 13/12/2021 12:58 p.m., Greg Minshall wrote:
>>>> Jeff,
>>>>
>>>>> This sounds like an OS feature, not an R feature...
certainly not a
>>>>> portable R feature.
>>>>
>>>> i'm not arguing for it, but this seems to me like something
that could
>>>> be a language feature.
>>>>
>>>
>>> R functions can call libraries written in other languages, and can
start processes, etc. R doesn't know everything going on in every function
call, and would have a lot of trouble saving it.
>>>
>>> If you added some limitations, e.g. a process that periodically has
its entire state stored in R variables, then it would be a lot easier.
>>>
>>> Duncan Murdoch
>>
>> --
>> Andy Jacobson
>> andy at yovo.org
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
--
Andy Jacobson
andy.jacobson at noaa.gov
NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305
303/497-4916