In an effort to maximize the use of all cores within our center, we are providing access to nearly all cluster nodes whenever they are idle.  This means that academic users will have access to idle nodes on the industry cluster as well as most of the faculty clusters.



How does this work?

All clusters are broken up into several partitions that users can request to have their jobs run on.  All academic users have access to all partitions in the UB-HPC cluster; however, they do not have access to the industry cluster or private clusters purchased by faculty groups or departments.  A scavenger partition has been created on all the clusters which allows jobs to run when there are no other pending jobs in the compute partitions on these clusters.  


Once a user with access to a specific cluster submits a job requesting resources, jobs in the scavenger partition are stopped and re-queued.  This means if you're running a job in the scavenger partition on the industry cluster and an industry user submits a job requiring the resources you're consuming, your job will be stopped.



Requirements:

  • Your jobs MUST be able to checkpoint otherwise you'll lose any work you've done when your jobs are stopped and re-queued.  Please see our documentation about checkpointing here. 
  • You must be an advanced user that understands the queuing system, runs efficient jobs, and can get checkpointing working on your jobs independently.  CCR staff can not devote the time to helping you write your code.
  • If your jobs are determined to cause problems on any of the private cluster nodes and we receive complaints from the owners of those nodes, your access to the scavenger partitions will be removed.



How to Run Preemptible Scavenger Jobs on Private Clusters: 


There are several commands available for viewing and using the scavenger partitions.  These commands will help determine which nodes are idle and match the user's desired criteria, submit an appropriate SLURM batch script to request these nodes, and display the scavenger partition queues.  The commands include:



scavenger-profiler and scavenger-profiler-all


The scavenger-profiler command will display the types of nodes which are currently idle. It shows the number of nodes, the number of cores, the amount of memory and any associated nodes constraints (e.g. Infiniband [IB] or CPU type).  They are displayed in order from best to worst CPU.  The scavenger-profiler-all shows the same information but for all nodes in the scavenger partition, not just the idle ones.  For example:


[ccruser@vortex:]$ scavenger-profiler-all

Available node types, ordered from best to worst CPU:

 # OF NODES | # CORES | MEMORY (mb) | TAGS

======================================================

    7 | 20 | 128000 | CHEMISTRY,IB,CPU-E5-2650v3,INT

    27 | 20 | 64000 | MAE,IB,CPU-E5-2650v3,INTEL  

    16 | 16 | 64000 | CHEMISTRY,CPU-E5-2630v3,INTEL 

    37 | 12 | 128000 | CHEMISTRY,CPU-E5-2620v3,INTEL 

    12 | 12 | 128000 | MAE,CPU-E5-2620v3,INTEL  

    5 | 4 | 31000 | PHYSICS,CPU-E3-1220v2,INTEL  

    19 | 16 | 128000 | CHEMISTRY,IB,CPU-E5-2660,INTEL

    14 | 12 | 48000 | CHEMISTRY,CPU-E5-2430,INTEL  

    3 | 12 | 48000 | CHEMISTRY,IB,CPU-E5645,INTEL 

    19 | 8 | 7000+ | PHYSICS,IB,CPU-E5420,INTEL  

    25 | 8 | 23000+ | CHEMISTRY,IB,CPU-L5520,INTEL 

    14 | 16 | 64000 | MAE,CPU-4386,AMD  

    14 | 32 | 128000 | MAE,IB,CPU-6320,AMD  

    6 | 12 | 31000 | MAE,CPU-4162EE,AMD  

    24 | 8 | 23000 | PHYSICS,IB,CPU-E5620,INTEL  

To used a specific set or subset of nodes, use the following in the submitted slurm script:

 --constraint=CPU-E5-2650v3

 or 

 --constraint=CPU-E5-2650v3|CPU-E5-2630v3|CPU-E5-2620v3        

  

  


scabatch

The scabatch command is very similar to sbatch and should be used accordingly. However, SLURM scripts submitted using scabatch should not specify a partition, qos,  or cluster argument --- these fields will be set by scabatch



For example - create a SLURM script with your requirements (called slurm.script in this example):


#!/bin/sh 

 #SBATCH --time=01:15:00

 #SBATCH --nodes=1

 #SBATCH --ntasks-per-node=4

 #SBATCH --mem=6000

 #SBATCH --job-name="scavenger_job_1"

 #SBATCH --output=scavenger_job_1.out

 #SBATCH --mail-type=ALL

 #SBATCH --constraint=CPU-4386

 #SBATCH --requeue

 echo "Start of Job"

 do_my_work.sh

 echo "End of Job"


Submit the job using:

scabatch slurm.script


The scabatch.sh script will look for nodes which match your request. It will search the industry, chemistry, MAE and physics clusters.


[ccruser@vortex:~]$ scabatch slurm.script

 You asked for the following:


 #SBATCH --nodes=1

 #SBATCH --mem=6000

 #SBATCH --ntasks-per-node=4

 #SBATCH --constraint=CPU-4386


 Checking the industry cluster for appropriate node type....


 SORRY - No appropriate nodes available on the industry cluster, but this is what is available:


  # OF NODES | # CORES | MEMORY (mb) | TAGS

 ======================================================

     1 | 8 | 64000 | IB,CPU-E5-2650v2


 Checking the chemistry cluster for appropriate node type....


 SORRY - No appropriate nodes available on the chemistry cluster, but this is what is available:


  # OF NODES | # CORES | MEMORY (mb) | TAGS

 ======================================================

     2 | 4 | 23000 | IB,CPU-L5520

     4 | 6 | 48000 | CPU-E5-2430

     16 | 6 | 128000 | CPU-E5-2620v3

     9 | 8 | 64000 | CPU-E5-2630v3


 Checking the mae cluster for appropriate node type....

 Submitted batch job 267636 on cluster mae



scavq

The scavq command is is used to view your jobs. This will allow you to see the status and delete them if needed. 


[ccruser@vortex:~]# scavq

Jobs for everyone:

CLUSTER: industry

             JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

           1144149 scavenger FISBATCH ccrtest1 PD 0:00 4 (Resources)

           1146941 scavenger CsH3 ccrtest2 R 1:00:06 1 cpn-m28-38-02


To delete your jobs above use: scancel -M industry <jobid>


Jobs for everyone:

CLUSTER: chemistry

             JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

            363923 scavenger FSEhcp ccruser R 12:14:28 4 cpn-f15-[04-05,11,13]

            363257 scavenger bcc ccruser R 1-23:18:23 1 cpn-p27-09-02

            363262 scavenger bcc ccruser R 1-23:18:23 1 cpn-p27-12-01

            363267 scavenger bcc ccruser R 1-23:18:23 1 cpn-p27-14-02

            363219 scavenger bcc ccruser R 2-09:34:26 1 cpn-p27-08-01

            363221 scavenger bcc ccruser gad R 2-09:34:26 1 cpn-p27-09-01


To delete your jobs above use: scancel -M chemistry <jobid>


Jobs for everyone:

CLUSTER: mae

             JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

            299022 scavenger FSEhcp ccruser R 6:24:49 2 cpn-f14-[23-24]

            299020 scavenger FSEhcp ccruser R 8:14:31 2 cpn-f14-[07-08]

            299016 scavenger FSEhcp ccruser R 9:21:41 2 cpn-f14-[14,16]

            299013 scavenger FSEfcc ccruser R 12:17:53 2 cpn-f14-[06,13]

            299024 scavenger bcc ccruser R 2:12:32 2 cpn-p28-[07-08]

            299030 scavenger bcc ccruser R 1:57:31 2 cpn-p28-[09-10]

            299009 scavenger convHarm ccruser R 11:37:37 1 cpn-p28-30


To delete your jobs above use: scancel -M mae <jobid>


Jobs for everyone:

CLUSTER: physics

             JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)


To delete your jobs above use: scancel -M physics <jobid>



scavenger-checker and scavenger-checker-all

These commands are very similar to the scavenger-profiler and scavenger-profiler-all commands.  The information is just displayed differently.  The scavenger-checker command will display a list of all available nodes on each cluster and includes the number of cores, memory, SLURM tags (i.e. cpu type), and number of nodes per type.  The scavenger-checker-all command displays all, not just the available, nodes in the scavenger partitions on all the clusters.


[ccruser@vortex:]$ ./scavenger-checker

Available nodes on the industry cluster:

 # OF NODES | # CORES | MEMORY (mb) | TAGS

======================================================

Available nodes on the chemistry cluster:

 # OF NODES | # CORES | MEMORY (mb) | TAGS

======================================================

    25 | 8 | 23000+ | CHEMISTRY,IB,CPU-L5520,INTEL

    1 | 8 | 48000 | CHEMISTRY,IB,CPU-E5520,INTEL

    3 | 12 | 48000 | CHEMISTRY,IB,CPU-E5645,INTEL

    19 | 16 | 128000 | CHEMISTRY,IB,CPU-E5-2660,INTEL

    14 | 12 | 48000 | CHEMISTRY,CPU-E5-2430,INTEL

    37 | 12 | 128000 | CHEMISTRY,CPU-E5-2620v3,INTEL

    16 | 16 | 64000 | CHEMISTRY,CPU-E5-2630v3,INTEL

    7 | 20 | 128000 | CHEMISTRY,IB,CPU-E5-2650v3,INT

Available nodes on the mae cluster:

 # OF NODES | # CORES | MEMORY (mb) | TAGS

======================================================

    14 | 16 | 64000 | MAE,CPU-4386,AMD

    14 | 32 | 128000 | MAE,IB,CPU-6320,AMD

    6 | 12 | 31000 | MAE,CPU-4162EE,AMD

    12 | 12 | 128000 | MAE,CPU-E5-2620v3,INTEL

    27 | 20 | 64000 | MAE,IB,CPU-E5-2650v3,INTEL

Available nodes on the physics cluster:

 # OF NODES | # CORES | MEMORY (mb) | TAGS

======================================================

    5 | 4 | 31000 | PHYSICS,CPU-E3-1220v2,INTEL

    24 | 8 | 23000 | PHYSICS,IB,CPU-E5620,INTEL

    19 | 8 | 7000+ | PHYSICS,IB,CPU-E5420,INTEL


 


Remember, your jobs are running on idle nodes purchased by another faculty member or group and your job can be interrupted at any point!




Why are faculty groups allowing this access on their nodes?

We have a great group of faculty members who understand the importance of maximizing cycles and enabling research.  There are many times they only need their nodes for specific chunks of time when they're working on grant proposals or preparing for presentations or publications.  During their idle times, they're more than happy to share their equipment with other researchers.  We'd like to thank the following faculty for their generous donation of cycles:


Alexey Akimov

Jochen Autschbach

Rajan Batta

Paul Bauman

Sara Behdad

Jason Benedict

Timothy Cook

Paul Desjardin

Michel Dupuis

Johannes Hachmann

John Hall

Margarete Jadamec

Jee Kang

Matthew Knepley

Andrew Murkin

Salvatore Rappoccio

David Salac

Vaikuntanath Samudrala

Puneet Singla

Konstantinos Slavakas

Jose Walteros

Olga Wodo

Peihong Zhang

Eva Zurek

Jaroslaw Zola

UB Genomics & Bioinformatics Core


We really appreciate your willingness to be good citizens!