Mizanur Khondoker
2008-Oct-14 14:06 UTC
[R] How to checkpoint-restart R jobs in batch mode?
Dear list, Most high performance computing clusters/grid engines have some restrictions on how long a job can be run in batch mode. The cluster I am using has maximum of 48 hours limit, but my job would take far more than that. I know that it is possible to checkpoint jobs without modifying the code if some specialized software (e.g., BLCR ) is installed on the grid engine. However, I am looking for a solution when this kind of facility is not available on the cluster, for example , by modifying the code so that the job can checkpoint and restart by itself. Does anyone have any experience or idea of doing so? Any help would be greatly appreciated. -- Mizanur Khondoker Division of Pathway Medicine (DPM) The University of Edinburgh Medical School The Chancellor's Building 49 Little France Crescent Edinburgh EH16 4SB United Kingdom Tel: +44 (0) 131 242 6287 Fax: +44 (0) 131 242 6244 http://www.pathwaymedicine.ed.ac.uk/ [[alternative HTML version deleted]]
Prof Brian Ripley
2008-Oct-14 14:29 UTC
[R] How to checkpoint-restart R jobs in batch mode?
On Tue, 14 Oct 2008, Mizanur Khondoker wrote:> Dear list, > > Most high performance computing clusters/grid engines have some > restrictions on how long a job can be run in batch mode. > The cluster I am using has maximum of 48 hours limit, but my job would take > far more than that. > > I know that it is possible to checkpoint jobs without modifying the code if > some specialized software (e.g., BLCR ) is installed on the grid engine. > > However, I am looking for a solution when this kind of facility is not > available on the cluster, for example , by modifying the code so that the > job can checkpoint and restart by itself. > > Does anyone have any experience or idea of doing so? Any help would be > greatly appreciated.Yes, we've done this for many years, generally by saving the workspace every few hours (in our case say every 100 simulation runs), and making sure that the workspace contains enough information to restart at the save points. This approach does depend on the run coming back to a simply reproducible point fairly often: if it is a simulation running entirely in C++ code in a package you have little hope.> > -- > Mizanur Khondoker > Division of Pathway Medicine (DPM) > The University of Edinburgh Medical School > The Chancellor's Building > 49 Little France Crescent > Edinburgh EH16 4SB > United Kingdom > > Tel: +44 (0) 131 242 6287 > Fax: +44 (0) 131 242 6244 > http://www.pathwaymedicine.ed.ac.uk/ > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Apparently Analagous Threads
- constructing arbitrary (positive definite) covariance matrix
- Singularity of lda function in MASS package
- Asterisk 1.4 to AS5400 using H.323 (ooh323) inbound working but outbound doesn't
- Problems patching fs/jbd/checkpoint.c in RHEL4 2.6.9-67.0.4 kernel
- secureCRT 3.3 -> openssh v3.7pl (checkpoint firewall)