Batch processing involves users submitting jobs to a scheduler and resource manager that decides the best and most efficient way to run the jobs while maintaining the highest possible usage of all resources.  At CCR, we use the SLURM (Simple Linux Utility for Resource Management) which provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs.  Users of the CCR clusters submit job scripts to the SLURM scheduler which evaluates the request for resources (cpu, memory, time, etc).  Jobs are placed in the queue while they wait to be scheduled and assigned to nodes.


Benefits of Batch Computing:
  • It allows sharing of computer resources among many users and programs.
  • It shifts the time of job processing to when the computing resources are less busy.
  • It avoids idling the computing resources with minute-by-minute manual intervention and supervision.
  • By keeping a high overall rate of utilization, it better amortizes the cost of a computer, especially an expensive one.

Source - Wikipedia 



Testing on the Front-end (login machine):

  • The front-end machines (rush and presto) can be used for tests that run for a few minutes and do not use an extensive amount of memory.
  • The maximum amount of time for running tests on the front end servers is 30 minutes.



Batch System:

The compute nodes in all CCR clusters are available in SLURM partitions.  User submit jobs to request node resources in a partition.  SLURM partitions for general use are in the UB-HPC academic cluster and are labeled: general-compute, debug, gpu, largemem, and viz (available through the remote visualization portal)
.  The default partition is the general-compute partition.  The partitions available on the industry cluster are labeled: industry (available to industry partners only) and scavenger (available to academic users - jobs must be able to checkpoint).  There are clusters of nodes purchased by individual lab or departmental groups broken up into partitions.  These partitions are available only to users in the groups that own the nodes.  More information about faculty clusters can be found here.  Some faculty groups allow use of their idle nodes and these are part of the scavenger partition as well.  For more information about using the scavenger partition, please read this article.


SLURM provides scalability and performance. It can manage and allocate the compute nodes for large clusters. SLURM can accept up to 1,000 jobs a second.  



SLURM Commands


The following is a list of useful commands available for SLURM.  Some of these were built by CCR to allow easier reporting for users.

For usage information for these commands, use --help (example: sinfo --help)

Use the linux command 'man' for more information about most of these commands (example:  man sinfo)


Bold-italicized font on the commands below indicates user supplied information.  Brackets indicate optional flags.

List SLURM commands


slurmhelp
View information about SLURM nodes & partitions

 sinfo [-p partition_name or -M cluster_name]

List example SLURM scripts


ls -p /util/slurm-scripts | less

Submit a job script for later execution

sbatch script-file
Cancel a pending or running job

scancel jobid
Check the state of a user’s jobs

squeue --user=username
Allocate compute nodes for interactive use

salloc
Run a command on allocated compute nodes srun

Display node information

snodes [node cluster/partition state]
Launch an interactive job

fisbatch [various sbatch options]
See fisbatch KB article
List priorities of queued jobs

sranks
See job priority KB article
Get the efficiency of a running job

sueff user-name
Get SLURM accounting information for a user’s
jobs from start date to now

suacct start-date user-name
See job history & accounting KB article
Get SLURM accounting and node information for a job

slist jobid

See job history & accounting KB article

Get resource usage and accounting information for a user’s
jobs from start date to now

slogs start-date user-list

Get estimated starting times for queued jobs

stimes [various squeue options]
See job status KB article
Monitor performance of a SLURM job

/util/ccrjobvis/slurmjobvis jobid


Submitting SLURM jobs