Batch computing involves users submitting jobs to a scheduler and resource manager that decides the best and most efficient way to run the jobs while maintaining the highest possible usage of all resources.  At CCR, we use the SLURM (Simple Linux Utility for Resource Management) which provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs.  Users of the CCR clusters submit job scripts (executable programs) to the SLURM scheduler which evaluates the request for resources (cpu, memory, time, etc).  Jobs are placed in the queue while they wait to be scheduled and assigned to nodes.

Here's a simple way to think about batch computing:  Consider entering the busiest restaurant in town at 6pm on a Saturday evening with a group of 10 friends.  You have no reservation.  There is a long line of diners in front of you waiting for tables.  When you approach the hostess, she asks you a number of questions.  Do you all want to sit together?  If you do, the wait will be very long.  If you are willing to split up your party, do you need to sit near each other in the dining room or can you sit across the room from each other?  Do you want a specific type of table (for example, a booth or bar top) or are you happy with anything you can get?  Are you willing to split up and sit with other people that you may not know?  While these questions may seem silly and implausible for a dinner party, they illustrate options for getting your party seated and fed as quickly as possible.  If you're trying to catch a show or sporting event, maybe your goal is to get in and out as quickly as possible.  If you'd like to linger over dinner and engage in conversation with your party of 9 friends, then find your options are more limited.  The batch scheduler uses these same premises.  If you need specific hardware (all the same CPUs, specific amount of memory, and/or the fast infiniband network), your jobs will wait longer in the queue until nodes that meet these requirements are available.  If you're willing to share nodes with other users, the scheduler will fit your jobs in where there are available resources.  (NOTE: this can be troublesome if other users' jobs end up utilizing all the resources on a node and starve your jobs).  You will need to decide what your jobs require and what options you can compromise on.

Benefits of Batch Computing:
  • It allows sharing of computer resources among many users and programs.
  • It shifts the time of job processing to when the computing resources are less busy.
  • It avoids idling the computing resources with minute-by-minute manual intervention and supervision.
  • By keeping a high overall rate of utilization, it better amortizes the cost of a computer, especially an expensive one.

Source - Wikipedia 

Testing on the Front-end (login machine):

  • The front-end machines (rush and presto) can be used for tests that run for a few minutes and do not use an extensive amount of memory.
  • The maximum amount of time for running tests on the front end servers is 30 minutes.

Batch System:

The compute nodes in all CCR clusters are available in SLURM partitions.  User submit jobs to request node resources in a partition.  SLURM partitions for general use are in the UB-HPC academic cluster and are labeled: general-compute, debug, gpu, largemem, and viz (available through the OnDemand portal).  The default partition is the general-compute partition.  The partitions available on the industry cluster are labeled: industry (available to industry partners only) and scavenger (available to academic users - jobs must be able to checkpoint).  There are clusters of nodes purchased by individual lab or departmental groups broken up into partitions.  These partitions are available only to users in the groups that own the nodes.  More information about faculty clusters can be found here.  Some faculty groups allow use of their idle nodes and these are part of the scavenger partition as well.  For more information about using the scavenger partition, please read this article.

SLURM provides scalability and performance. It can manage and allocate the compute nodes for large clusters. SLURM can accept up to 1,000 jobs a second.  

SLURM Commands

The following is a list of useful commands available for SLURM.  Some of these were built by CCR to allow easier reporting for users.

For usage information for these commands, use --help (example: sinfo --help)

Use the linux command 'man' for more information about most of these commands (example:  man sinfo)

Bold-italicized font on the commands below indicates user supplied information.  Brackets indicate optional flags.

List SLURM commands

View information about SLURM nodes & partitions

 sinfo [-p partition_name or -M cluster_name]

List example SLURM scripts

ls -p /util/slurm-scripts | less

Submit a job script for later execution

sbatch script-file
Cancel a pending or running job

scancel jobid
Check the state of a user’s jobs

squeue --user=username
Allocate compute nodes for interactive use

Run a command on allocated compute nodes srun

Display node information

snodes [node cluster/partition state]
Launch an interactive job

fisbatch [various sbatch options]
See fisbatch KB article
List priorities of queued jobs

See job priority KB article
Get the efficiency of a running job

sueff user-name
Get SLURM accounting information for a user’s
jobs from start date to now

suacct start-date user-name
See job history & accounting KB article
Get SLURM accounting and node information for a job

slist jobid
See job history & accounting KB article
Get resource usage and accounting information for a user’s
jobs from start date to now

slogs start-date user-list

Get estimated starting times for queued jobs

stimes [various squeue options]
See job status KB article
Monitor performance of a SLURM job

/util/ccrjobvis/slurmjobvis jobid

Submitting SLURM jobs