Checkpoint saves the processes of a running application to a file.  Later the processes can be restarted from that file.  Since the maximum time a job can run on the CCR clusters is 72 hours, many users utilize checkpointing to pickup a job where it left off and continue running.  Users are encouraged to investigate whether the software tools they're using are capable of stopping and restarting where a job leaves off.  It's best to use the software's built-in tools for checkpointing.

Checkpointing Jobs on the Cluster:

Checkpoint of a job can be done using Distributed MultiThreaded CheckPointing (DMTCP). Single threaded, multithreaded and openMP applications may be checkpointable. Support for checkpointing MPI-parallel applications in SLURM is currently under development by the DMTCP team in partnership with CCR staff. Not all applications can be checkpointed. Please see the DMTCP web-site for full details.


New Features!

With the release of version 2.2.x of DMTCP, the corresponding SLURM scripts have been greatly simplified! A single script that auto-detects whether to checkpoint or restart can now be used.

Instructions for the Checkpoint and Restart of an Application:

The code to be checkpointed does not need to be linked with any particular library. To use DMTCP, users should copy and modify the appropriate example checkpoint/restart scripts.

  • For serial or multi-threaded applications:

  • For OpenMP applications:

  • For an initial run of the application, submit the checkpoint script using sbatch. For example:
    $ sbatch my_slurm_dmtcp_serial

  • Use the same script to extend (i.e. restart) a previously checkpointed (or restarted) job - the script will discover the presence of DMTCP checkpoint files and will automatically perform a restart. For example:
    $ sbatch my_slurm_dmtcp_serial