In an effort to maximize the use of all available cores within our center, we are providing access to all faculty cluster nodes whenever they are idle.  This means that academic users will have access to idle nodes on the industry cluster as well as nodes the faculty cluster.



How does this work?

Clusters are broken up into several partitions that users can request to have their jobs run on.  All academic users have access to all partitions in the UB-HPC cluster; however, they do not have access to the industry cluster or partitions purchased by faculty groups or departments.  A scavenger partition has been created on all the clusters which allows jobs to run when there are no other pending jobs in the compute partitions on these clusters.  


Once a user with access to a specific cluster submits a job requesting resources, jobs in the scavenger partition are stopped and re-queued.  This means if you're running a job in the scavenger partition on the industry cluster and an industry user submits a job requiring the resources you're consuming, your job will be stopped.



Requirements:

  • Your jobs MUST be able to checkpoint otherwise you'll lose any work you've done when your jobs are stopped and re-queued.  Please see our documentation about checkpointing here. 
  • You must be an advanced user that understands the queuing system, runs efficient jobs, and can get checkpointing working on your jobs independently.  CCR staff can not devote the time to helping you write your code.
  • If your jobs are determined to cause problems on any of the private cluster nodes and we receive complaints from the owners of those nodes, your access to the scavenger partitions will be removed.



How to Run Preemptible Scavenger Jobs on Private Clusters:


There are several commands available for viewing and using the scavenger partitions.  These commands will help determine which nodes are idle and match the user's desired criteria, submit an appropriate SLURM batch script to request these nodes, and display the scavenger partition queues.  

The commands include:



scavenger-profiler and scavenger-profiler-all


The scavenger-profiler command will display the types of nodes which are currently IDLE and available to run jobs.  It shows the number of nodes, the number of cores, the amount of memory and any associated nodes constraints (e.g. Infiniband [IB] or CPU type).  The scavenger-profiler-all shows the same information but for all nodes in the scavenger partition, not just the idle ones.  For example:


[ccruser@vortex1:~]$ scavenger-profiler-all

Available node types:

# OF NODES | # CORES | MEMORY (mb) | TAGS
======================================================
4 | 32 | 187000 | UBHPC&OPA&CPU-Gold-6130&INTEL&V100&u25
1 | 32 | 365000 | UBHPC&OPA&CPU-Gold-6130&INTEL&P100&u22
8 | 32 | 256000+ | UBHPC&IB&CPU-E7-4830&INTEL
1 | 32 | 256000 | UBHPC&IB&CPU-X7550&INTEL
4 | 32 | 187000 | UBHPC&OPA&CPU-Gold-6130&INTEL&V100&u22
1 | 32 | 365000 | UBHPC&OPA&CPU-Gold-6130&INTEL&P100&u25
2 | 32 | 256000 | UBHPC&CPU-E7-4830&INTEL
8 | 32 | 256000 | UBHPC&CPU-6132HE&AMD
26 | 32 | 187000+ | UBHPC&OPA&CPU-Gold-6130&INTEL&u22
4 | 32 | 187000 | UBHPC&OPA&CPU-Gold-6130&INTEL&V100&u23
25 | 32 | 187000+ | UBHPC&OPA&CPU-Gold-6130&INTEL&u23
4 | 32 | 187000 | UBHPC&OPA&CPU-Gold-6130&INTEL&V100&u24
25 | 32 | 187000+ | UBHPC&OPA&CPU-Gold-6130&INTEL&u24
26 | 32 | 187000+ | UBHPC&OPA&CPU-Gold-6130&INTEL&u25
1 | 32 | 187000 | UBHPC&CPU-Gold-6130&INTEL&V100
12 | 12 | 31000 | FACULTY&CPU-E5-2640&INTEL
10 | 40 | 384000 | FACULTY&CPU-Gold-6138&INTEL
2 | 8 | 48000 | FACULTY&IB&CPU-E5520&INTEL
74 | 24 | 64000+ | FACULTY&CPU-E5-2650v4&INTEL
5 | 64 | 256000 | FACULTY&CPU-6272&AMD
21 | 32 | 187000+ | FACULTY&CPU-Gold-6130&INTEL
3 | 64 | 256000 | FACULTY&CPU-6274&AMD
16 | 12 | 128000+ | FACULTY&CPU-Gold-6126&INTEL
3 | 36 | 250000 | FACULTY&CPU-E5-2695v4&INTEL
23 | 8 | 23000 | FACULTY&IB&CPU-L5520&INTEL
16 | 32 | 128000 | FACULTY&IB&CPU-6320&AMD
18 | 16 | 128000 | FACULTY&IB&CPU-E5-2660&INTEL
1 | 32 | 256000 | FACULTY&IB&CPU-E7-4830&INTEL
2 | 24 | 128000 | FACULTY&IB&CPU-E5-2650v4&INTEL
34 | 20 | 64000+ | FACULTY&IB&CPU-E5-2650v3&INTEL
1 | 12 | 128000 | FACULTY&CPU-E5-2430v2&INTEL
1 | 32 | 187000 | FACULTY&CPU-Gold-6130&INTEL&P4000
50 | 24 | 183000 | FACULTY&CPU-Gold-6126&IB&OPA&INTEL
2 | 32 | 92000 | FACULTY&CPU-Gold-5218&INTEL
2 | 24 | 92000 | FACULTY&CPU-Gold-5118&INTEL
4 | 12 | 48000 | FACULTY&CPU-E5645&INTEL
4 | 8 | 32000 | FACULTY&CPU-E5620&INTEL
2 | 8 | 32000 | FACULTY&CPU-E5520&INTEL
2 | 8 | 15500 | FACULTY&CPU-E5420&INTEL
3 | 4 | 31000 | FACULTY&CPU-E3-1220v2&INTEL
31 | 8 | 10500 | FACULTY&CPU-L5520&INTEL&IB
23 | 8 | 23000 | FACULTY&IB&CPU-E5620&INTEL
14 | 16 | 64000 | FACULTY&CPU-4386&AMD
4 | 12 | 31000 | FACULTY&CPU-4162EE&AMD
13 | 12 | 48000 | FACULTY&CPU-E5-2430&INTEL
8 | 20 | 124000 | FACULTY&CPU-E5-2630v4&INTEL
9 | 32 | 187000 | FACULTY&CPU-Gold-6130&INTEL&V100
3 | 24 | 42000 | FACULTY&CPU-6176SE&AMD
43 | 12 | 128000 | FACULTY&CPU-E5-2620v3&INTEL
16 | 16 | 64000 | FACULTY&CPU-E5-2630v3&INTEL
4 | 40 | 187000 | FACULTY&CPU-Gold-6230&INTEL
1 | 16 | 30000 | FACULTY&CPU-E5-2620v4&INTEL
72 | 16 | 64000 | INDUSTRY&CPU-E5-2650v2&INTEL
144 | 16 | 64000 | INDUSTRY&IB&CPU-E5-2650v2&INTEL


To used a specific set or subset of nodes, use the following in the submitted slurm script:

--constraint=CPU-E5-2650v3
or
--constraint=CPU-E5-2650v3|CPU-E5-2630v3|CPU-E5-2620v3


scabatch

The scabatch command is very similar to sbatch and should be used accordingly. However, SLURM scripts submitted using scabatch should not specify a partition, qos,  or cluster argument --- these fields will be set by scabatch



For example - create a SLURM script with your requirements (called slurm.script in this example):


#!/bin/sh 

 #SBATCH --time=01:15:00

 #SBATCH --nodes=1

 #SBATCH --ntasks-per-node=4

 #SBATCH --mem=6000

 #SBATCH --job-name="scavenger_job_1"

 #SBATCH --output=scavenger_job_1.out

 #SBATCH --mail-type=ALL

 #SBATCH --constraint=CPU-4386

 #SBATCH --requeue

 echo "Start of Job"

 do_my_work.sh

 echo "End of Job"


Submit the job using:

scabatch slurm.script


The scabatch.sh script will look for nodes which match your request. It will search the industry, faculty and ub-hpc clusters.


[ccruser@vortex:~]$ scabatch slurm.script

You asked for the following:

#SBATCH --nodes=1
#SBATCH --mem=6000
#SBATCH --ntasks-per-node=4
#SBATCH --constraint=CPU-4386

Checking the ub-hpc cluster for appropriate node type....

SORRY - No appropriate nodes available on the ub-hpc cluster, but this is what is available:

# OF NODES | # CORES | MEMORY (mb) | TAGS
======================================================
3 | 32 | 754000 | UBHPC,OPA,CPU-Gold-6130,INTEL,u22
1 | 32 | 754000 | UBHPC,OPA,CPU-Gold-6130,INTEL,u23
1 | 32 | 754000 | UBHPC,OPA,CPU-Gold-6130,INTEL,u24
2 | 32 | 754000 | UBHPC,OPA,CPU-Gold-6130,INTEL,u25
1 | 32 | 187000 | UBHPC,CPU-Gold-6130,INTEL,V100

Checking the industry cluster for appropriate node type....

SORRY - No appropriate nodes available on the industry cluster, but this is what is available:

# OF NODES | # CORES | MEMORY (mb) | TAGS
======================================================
24 | 16 | 64000 | INDUSTRY,CPU-E5-2650v2,INTEL
47 | 16 | 64000 | INDUSTRY,IB,CPU-E5-2650v2,INTEL


Checking the faculty cluster for appropriate node type....
Submitted batch job 484442 on cluster faculty


scavq

The scavq command is is used to view your jobs. This will allow you to see the status and delete them if needed.


[ccruser@vortex1:~]$ scavq all
Jobs for everyone:
CLUSTER: ub-hpc
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

To delete your jobs above use: scancel -M ub-hpc <jobid>

Jobs for everyone:
CLUSTER: industry
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2921756 scavenger TD10 sghaneei PD 0:00 3 (JobHeldUser)
4791410 scavenger CaNaH14-5x36 nishagen R 1:22:54 1 cpn-m25-03-02
4791409 scavenger CaNaH14-7x45 nishagen R 1:23:45 1 cpn-m25-03-01
4791408 scavenger CaNaH14-7x50 nishagen R 1:24:42 1 cpn-m25-02-02
4791407 scavenger CaNaH14-7x50 nishagen R 1:25:22 1 cpn-m25-02-01
4791406 scavenger CaNaH14-9x99 nishagen R 1:26:19 1 cpn-m25-01-02
4791405 scavenger CaNaH14-9x99 nishagen R 1:26:36 1 cpn-m25-01-01
4791424 scavenger CaNaH23-150- nishagen R 1:06:58 1 cpn-m25-12-02
4791419 scavenger CaNaH21-150- nishagen R 1:12:33 1 cpn-m25-10-01

To delete your jobs above use: scancel -M industry <jobid>

Jobs for everyone:
CLUSTER: faculty
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2724236 scavenger CaNaH12-7x43 nishagen R 1:37:23 4 cpn-p25-[11-14]
2724235 scavenger CaNaH12-7x43 nishagen R 1:38:18 4 cpn-p25-[04,08-10]
2724233 scavenger CaNaH10-Cmm nishagen R 1:47:23 3 cpn-f11-[03-05]
2724232 scavenger CaNaH10-Cm nishagen R 1:52:26 3 cpn-p25-[05-07]
2724244 scavenger stucot.slurm xyu36 R 54:05 1 cpn-f15-33

To delete your jobs above use: scancel -M faculty <jobid>


scavenger-checker and scavenger-checker-all

These commands are very similar to the scavenger-profiler and scavenger-profiler-all commands.  The information is just displayed differently.  The scavenger-checker command will display a list of all available nodes on each cluster and includes the number of cores, memory, SLURM tags (i.e. cpu type), and number of nodes per type.  

The scavenger-checker-all command  just the available, nodes in the scavenger partitions on all the clusters.


[ccruser@vortex1:~]$ scavenger-checker

Available nodes on the ub-hpc cluster:

# OF NODES | # CORES | MEMORY (mb) | TAGS
======================================================
3 | 32 | 754000 | UBHPC&OPA&CPU-Gold-6130&INTEL&u22
1 | 32 | 754000 | UBHPC&OPA&CPU-Gold-6130&INTEL&u23
1 | 32 | 754000 | UBHPC&OPA&CPU-Gold-6130&INTEL&u24
2 | 32 | 754000 | UBHPC&OPA&CPU-Gold-6130&INTEL&u25
1 | 32 | 187000 | UBHPC&CPU-Gold-6130&INTEL&V100

Available nodes on the industry cluster:

# OF NODES | # CORES | MEMORY (mb) | TAGS
======================================================
24 | 16 | 64000 | INDUSTRY&CPU-E5-2650v2&INTEL
47 | 16 | 64000 | INDUSTRY&IB&CPU-E5-2650v2&INTEL

Available nodes on the faculty cluster:

# OF NODES | # CORES | MEMORY (mb) | TAGS
======================================================
7 | 8 | 23000 | FACULTY&IB&CPU-L5520&INTEL
36 | 24 | 64000+ | FACULTY&CPU-E5-2650v4&INTEL
4 | 12 | 48000 | FACULTY&CPU-E5645&INTEL
4 | 8 | 32000 | FACULTY&CPU-E5620&INTEL
2 | 8 | 32000 | FACULTY&CPU-E5520&INTEL
2 | 8 | 15500 | FACULTY&CPU-E5420&INTEL
3 | 4 | 31000 | FACULTY&CPU-E3-1220v2&INTEL
11 | 12 | 31000 | FACULTY&CPU-E5-2640&INTEL
31 | 8 | 10500 | FACULTY&CPU-L5520&INTEL&IB
16 | 8 | 23000 | FACULTY&IB&CPU-E5620&INTEL
14 | 16 | 64000 | FACULTY&CPU-4386&AMD
12 | 32 | 128000 | FACULTY&IB&CPU-6320&AMD
4 | 12 | 31000 | FACULTY&CPU-4162EE&AMD
12 | 12 | 48000 | FACULTY&CPU-E5-2430&INTEL
4 | 20 | 124000 | FACULTY&CPU-E5-2630v4&INTEL
11 | 32 | 187000 | FACULTY&CPU-Gold-6130&INTEL
4 | 32 | 187000 | FACULTY&CPU-Gold-6130&INTEL&V100
2 | 64 | 256000 | FACULTY&CPU-6274&AMD
3 | 24 | 42000 | FACULTY&CPU-6176SE&AMD
41 | 12 | 128000 | FACULTY&CPU-E5-2620v3&INTEL
15 | 16 | 64000 | FACULTY&CPU-E5-2630v3&INTEL
3 | 20 | 64000 | FACULTY&IB&CPU-E5-2650v3&INTEL
4 | 40 | 187000 | FACULTY&CPU-Gold-6230&INTEL
1 | 16 | 30000 | FACULTY&CPU-E5-2620v4&INTEL
3 | 24 | 187000 | FACULTY&CPU-Gold-6126&INTEL
1 | 24 | 92000 | FACULTY&CPU-Gold-5118&INTEL



Remember, your jobs are running on idle nodes purchased by another faculty member or group and your job can be interrupted at any point!