Checkpoint saves the processes of a running application to a file.  Later the processes can be restarted from that file.  Since the maximum time a job can run on the CCR clusters is 72 hours, many users utilize checkpointing to pickup a job where it left off and continue running.


Checkpointing Jobs on the Cluster:


Checkpoint of a job can be done using Distributed MultiThreaded CheckPointing (DMTCP). Single threaded, multithreaded and openMP applications may be checkpointable. Support for checkpointing MPI-parallel applications in SLURM is currently under development by the DMTCP team in partnership with CCR staff. Not all applications can be checkpointed. Please see the DMTCP web-site for full details.

 

New Features!

With the release of version 2.2.x of DMTCP, the corresponding SLURM scripts have been greatly simplified! A single script that auto-detects whether to checkpoint or restart can now be used.


Instructions for the Checkpoint and Restart of an Application:

The code to be checkpointed does not need to be linked with any particular library. To use DMTCP, users should copy and modify the appropriate example checkpoint/restart scripts.

  • For serial or multi-threaded applications:
    /util/academic/dmtcp/examples/serial/slurm_dmtcp_serial

  • For OpenMP applications:
    /util/academic/dmtcp/examples/openmp/slurm_dmtcp_openmp

  • For an initial run of the application, submit the checkpoint script using sbatch. For example:
    $ sbatch my_slurm_dmtcp_serial

  • Use the same script to extend (i.e. restart) a previously checkpointed (or restarted) job - the script will discover the presence of DMTCP checkpoint files and will automatically perform a restart. For example:
    $ sbatch my_slurm_dmtcp_serial