Checkpoint saves the processes of a running application to a file.  Later the processes can be restarted from that file.  Since the maximum time a job can run on the CCR clusters is 72 hours, many users utilize checkpointing to pickup a job where it left off and continue running.  Users are encouraged to investigate whether the software tools they're using are capable of stopping and restarting where a job leaves off.  It's best to use the software's built-in tools for checkpointing.


NOTICE:  DMTCP checkpointing will not work with the latest version of the scheduling software, SLURM, installed on June 25, 2018. The DMTCP developers have been alerted to the problem and are currently investigating.  A fix, if one can be found, is not expected until the end of summer.



Checkpointing Jobs on the Cluster:


Checkpoint of a job can be done using Distributed MultiThreaded CheckPointing (DMTCP). Single threaded, multithreaded and openMP applications may be checkpointable. Support for checkpointing MPI-parallel applications in SLURM is currently under development by the DMTCP team in partnership with CCR staff. Not all applications can be checkpointed. Please see the DMTCP web-site for full details.

 

New Features!

With the release of version 2.2.x of DMTCP, the corresponding SLURM scripts have been greatly simplified! A single script that auto-detects whether to checkpoint or restart can now be used.


Instructions for the Checkpoint and Restart of an Application:

The code to be checkpointed does not need to be linked with any particular library. To use DMTCP, users should copy and modify the appropriate example checkpoint/restart scripts.

  • For serial or multi-threaded applications:
    /util/academic/dmtcp/examples/serial/slurm_dmtcp_serial

  • For OpenMP applications:
    /util/academic/dmtcp/examples/openmp/slurm_dmtcp_openmp

  • For an initial run of the application, submit the checkpoint script using sbatch. For example:
    $ sbatch my_slurm_dmtcp_serial

  • Use the same script to extend (i.e. restart) a previously checkpointed (or restarted) job - the script will discover the presence of DMTCP checkpoint files and will automatically perform a restart. For example:
    $ sbatch my_slurm_dmtcp_serial