In an effort to maximize the use of all cores within our center, we are providing access to nearly all cluster nodes whenever they are idle.  This means that academic users will have access to idle nodes on the industry cluster as well as most of the faculty clusters.



How does this work?

All clusters are broken up into several partitions that users can request to have their jobs run on.  All academic users have access to all partitions in the UB-HPC cluster; however, they do not have access to the industry cluster or private clusters purchased by faculty groups or departments.  A scavenger partition has been created on all the clusters which allows jobs to run when there are no other pending jobs in the compute partitions on these clusters.  


Once a user with access to a specific cluster submits a job requesting resources, jobs in the scavenger partition are stopped and re-queued.  This means if you're running a job in the scavenger partition on the industry cluster and an industry user submits a job requiring the resources you're consuming, your job will be stopped.



Requirements:

  • Your jobs MUST be able to checkpoint otherwise you'll lose any work you've done when your jobs are stopped and re-queued.  Please see our documentation about checkpointing here. 
  • You must be an advanced user that understands the queuing system, runs efficient jobs, and can get checkpointing working on your jobs independently.  CCR staff can not devote the time to helping you write your code.
  • If your jobs are determined to cause problems on any of the private cluster nodes and we receive complaints from the owners of those nodes, your access to the scavenger partitions will be removed.



How to Run Preemptible Scavenger Jobs on Private Clusters: 


There are several commands available for viewing and using the scavenger partitions.  These commands will help determine which nodes are idle and match the user's desired criteria, submit an appropriate SLURM batch script to request these nodes, and display the scavenger partition queues.  The commands include:



scavenger-profiler and scavenger-profiler-all


The scavenger-profiler command will display the types of nodes which are currently IDLE and available to run jobs.  It shows the number of nodes, the number of cores, the amount of memory and any associated nodes constraints (e.g. Infiniband [IB] or CPU type).  The scavenger-profiler-all shows the same information but for all nodes in the scavenger partition, not just the idle ones.  For example:


[ccruser@vortex1:~]$ scavenger-profiler-all

Available node types

 # OF NODES | # CORES |  MEMORY (mb)  | TAGS
======================================================

    8       |   32    |    256000+    | UBHPC&IB&CPU-E7-4830&INTEL    
    2       |   32    |    365000     | UBHPC&OPA&CPU-Gold-6130&INTEL&
    1       |   32    |    256000     | UBHPC&CPU-E7-4830&INTEL      
    16      |   32    |    187000     | UBHPC&OPA&CPU-Gold-6130&INTEL&
    102     |   32    |    187000+    | UBHPC&OPA&CPU-Gold-6130&INTEL
    1       |   32    |    256000     | UBHPC&IB&CPU-X7550&INTEL      
    8       |   32    |    256000     | UBHPC&CPU-6132HE&AMD          
    1       |   32    |    187000     | UBHPC&CPU-Gold-6130&INTEL&V100
    10      |   40    |    384000     | MAE&CPU-Gold-6138&INTEL      
    3       |   36    |    250000     | MAE&CPU-E5-2695v4&INTEL      
    30      |   24    |    128000+    | MAE&CPU-E5-2650v4&INTEL      
    14      |   16    |    64000      | MAE&CPU-4386&AMD              
    16      |   32    |    128000     | MAE&IB&CPU-6320&AMD          
    7       |   64    |    256000     | MAE&CPU-6272&AMD              
    19      |   32    |    187000+    | MAE&CPU-Gold-6130&INTEL      
    8       |   32    |    187000     | MAE&CPU-Gold-6130&INTEL&V100  
    3       |   64    |    256000     | MAE&CPU-6274&AMD              
    4       |   24    |    42000+     | MAE&CPU-6176SE&AMD            
    27      |   20    |    64000      | MAE&IB&CPU-E5-2650v3&INTEL    
    1       |   12    |    128000     | MAE&CPU-E5-2430v2&INTEL      
    50      |   24    |    183000     | MAE&CPU-Gold-6126&IB&OPA&INTEL
    2       |   24    |    190000     | MAE&CPU-Gold-6126&INTEL      
    6       |   12    |    31000      | MAE&CPU-4162EE&AMD            
    12      |   12    |    128000     | MAE&CPU-E5-2620v3&INTEL      
    1       |   16    |    30000      | MAE&CPU-E5-2620v4&INTEL      
    37      |   12    |    128000     | CHEMISTRY&CPU-E5-2620v3&INTEL
    23      |   8     |    19000+     | CHEMISTRY&IB&CPU-L5520&INTEL  
    1       |   8     |    48000      | CHEMISTRY&IB&CPU-E5520&INTEL  
    38      |   24    |    64000+     | CHEMISTRY&CPU-E5-2650v4&INTEL
    14      |   12    |    48000      | CHEMISTRY&CPU-E5-2430&INTEL  
    2       |   12    |    48000      | CHEMISTRY&IB&CPU-E5645&INTEL  
    4       |   12    |    48000      | CHEMISTRY&CPU-E5645&INTEL    
    4       |   8     |    32000      | CHEMISTRY&CPU-E5620&INTEL    
    2       |   8     |    32000      | CHEMISTRY&CPU-E5520&INTEL    
    2       |   8     |    15500      | CHEMISTRY&CPU-E5420&INTEL    
    19      |   16    |    128000     | CHEMISTRY&IB&CPU-E5-2660&INTEL
    16      |   16    |    64000      | CHEMISTRY&CPU-E5-2630v3&INTEL
    2       |   24    |    128000     | CHEMISTRY&IB&CPU-E5-2650v4&INT
    7       |   20    |    128000     | CHEMISTRY&IB&CPU-E5-2650v3&INT
    8       |   24    |    187000     | CHEMISTRY&CPU-Gold-6126&INTEL
    2       |   24    |    92000      | CHEMISTRY&CPU-Gold-5118&INTEL
    72      |   16    |    64000      | INDUSTRY&CPU-E5-2650v2&INTEL  
    144     |   16    |    64000      | INDUSTRY&IB&CPU-E5-2650v2&INTE
    8       |   20    |    124000     | PHYSICS&CPU-E5-2630v4&INTEL  
    3       |   4     |    31000      | PHYSICS&CPU-E3-1220v2&INTEL  
    14      |   12    |    31000      | PHYSICS&CPU-E5-2640&INTEL    
    34      |   8     |    8500+      | PHYSICS&CPU-L5520&INTEL&IB    
    23      |   8     |    23000      | PHYSICS&IB&CPU-E5620&INTEL    
    7       |   24    |    128000     | PHYSICS&CPU-E5-2650v4&INTEL  

To used a specific set or subset of nodes, use the following in the submitted slurm script:

    --constraint=CPU-E5-2650v3
    or
    --constraint=CPU-E5-2650v3|CPU-E5-2630v3|CPU-E5-2620v3


  


scabatch

The scabatch command is very similar to sbatch and should be used accordingly. However, SLURM scripts submitted using scabatch should not specify a partition, qos,  or cluster argument --- these fields will be set by scabatch



For example - create a SLURM script with your requirements (called slurm.script in this example):


#!/bin/sh 

 #SBATCH --time=01:15:00

 #SBATCH --nodes=1

 #SBATCH --ntasks-per-node=4

 #SBATCH --mem=6000

 #SBATCH --job-name="scavenger_job_1"

 #SBATCH --output=scavenger_job_1.out

 #SBATCH --mail-type=ALL

 #SBATCH --constraint=CPU-4386

 #SBATCH --requeue

 echo "Start of Job"

 do_my_work.sh

 echo "End of Job"


Submit the job using:

scabatch slurm.script


The scabatch.sh script will look for nodes which match your request. It will search the industry, chemistry, MAE and physics clusters.


[ccruser@vortex:~]$ scabatch slurm.script

 You asked for the following:


 #SBATCH --nodes=1

 #SBATCH --mem=6000

 #SBATCH --ntasks-per-node=4

 #SBATCH --constraint=CPU-4386


 Checking the industry cluster for appropriate node type....


 SORRY - No appropriate nodes available on the industry cluster, but this is what is available:


  # OF NODES | # CORES | MEMORY (mb) | TAGS

 ======================================================

     1 | 8 | 64000 | IB,CPU-E5-2650v2


 Checking the chemistry cluster for appropriate node type....


 SORRY - No appropriate nodes available on the chemistry cluster, but this is what is available:


  # OF NODES | # CORES | MEMORY (mb) | TAGS

 ======================================================

     2 | 4 | 23000 | IB,CPU-L5520

     4 | 6 | 48000 | CPU-E5-2430

     16 | 6 | 128000 | CPU-E5-2620v3

     9 | 8 | 64000 | CPU-E5-2630v3


 Checking the mae cluster for appropriate node type....

 Submitted batch job 267636 on cluster mae


scavq

The scavq command is is used to view your jobs. This will allow you to see the status and delete them if needed. 


[ccruser@vortex1:~]$ scavq all
Jobs for everyone:
CLUSTER: ub-hpc
             JOBID  PARTITION            NAME     USER ST       TIME  NODES NODELIST(REASON)

To delete your jobs above use: scancel -M ub-hpc <jobid>

Jobs for everyone:
CLUSTER: industry
             JOBID  PARTITION            NAME     USER ST       TIME  NODES NODELIST(REASON)
           2921756  scavenger            TD10 sghaneei PD       0:00      3 (JobHeldUser)
           4791410  scavenger    CaNaH14-5x36 nishagen  R    1:22:54      1 cpn-m25-03-02
           4791409  scavenger    CaNaH14-7x45 nishagen  R    1:23:45      1 cpn-m25-03-01
           4791408  scavenger    CaNaH14-7x50 nishagen  R    1:24:42      1 cpn-m25-02-02
           4791407  scavenger    CaNaH14-7x50 nishagen  R    1:25:22      1 cpn-m25-02-01
           4791406  scavenger    CaNaH14-9x99 nishagen  R    1:26:19      1 cpn-m25-01-02
           4791405  scavenger    CaNaH14-9x99 nishagen  R    1:26:36      1 cpn-m25-01-01
           4791424  scavenger    CaNaH23-150- nishagen  R    1:06:58      1 cpn-m25-12-02
           4791419  scavenger    CaNaH21-150- nishagen  R    1:12:33      1 cpn-m25-10-01

To delete your jobs above use: scancel -M industry <jobid>

Jobs for everyone:
CLUSTER: chemistry
             JOBID  PARTITION            NAME     USER ST       TIME  NODES NODELIST(REASON)
           2724236  scavenger    CaNaH12-7x43 nishagen  R    1:37:23      4 cpn-p25-[11-14]
           2724235  scavenger    CaNaH12-7x43 nishagen  R    1:38:18      4 cpn-p25-[04,08-10]
           2724233  scavenger     CaNaH10-Cmm nishagen  R    1:47:23      3 cpn-f11-[03-05]
           2724232  scavenger      CaNaH10-Cm nishagen  R    1:52:26      3 cpn-p25-[05-07]
           2724244  scavenger    stucot.slurm    xyu36  R      54:05      1 cpn-f15-33

To delete your jobs above use: scancel -M chemistry <jobid>

Jobs for everyone:
CLUSTER: mae
             JOBID  PARTITION            NAME     USER ST       TIME  NODES NODELIST(REASON)
           1367422  scavenger Na-cryp/03/ADF/ lauraabe PD       0:00      1 (Priority)
           1367423  scavenger Na-cryp/03/ADF/ lauraabe PD       0:00      1 (Priority)
           1367692  scavenger             AgF   ezurek PD       0:00      1 (Priority)
           1367693  scavenger             AgF   ezurek PD       0:00      1 (Priority)
           1367694  scavenger             AgF   ezurek PD       0:00      1 (Priority)
           1367695  scavenger             AgF   ezurek PD       0:00      1 (Priority)
           1362061  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f13-20
           1362062  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-31
           1362063  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-32
           1362064  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-33
           1362067  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-36
           1362068  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-37
           1362110  scavenger             AgF   ezurek  R    3:08:44      1 cpn-f14-14
           1362069  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-38
           1362070  scavenger             AgF   ezurek  R    3:12:45      1 cpn-f14-39
           1362113  scavenger             AgF   ezurek  R    3:08:44      1 cpn-f14-17
           1362114  scavenger             AgF   ezurek  R    3:08:44      1 cpn-f14-18
           1362116  scavenger             AgF   ezurek  R    3:08:44      1 cpn-f14-23
           1362117  scavenger             AgF   ezurek  R    3:08:44      1 cpn-f14-24
           1367108  scavenger Na-cryp/01/ADF/ lauraabe  R      46:34      1 cpn-u28-12
           1367106  scavenger Na-cryp/02/ADF/ lauraabe  R      46:57      1 cpn-u12-17
           1367105  scavenger Na-cryp/02/ADF/ lauraabe  R      47:11      1 cpn-u12-03
           1367104  scavenger Na-cryp/02/ADF/ lauraabe  R      47:26      1 cpn-u27-10
           1367103  scavenger Na-cryp/02/ADF/ lauraabe  R      47:27      1 cpn-u12-13
           1364891  scavenger             AgF   ezurek  R    2:42:40      1 cpn-f13-18
           1367129  scavenger             AgF   ezurek  R    1:16:23      1 cpn-p28-16
           1367309  scavenger             AgF   ezurek  R      56:19      1 cpn-f14-41
           1367314  scavenger             AgF   ezurek  R      46:34      1 cpn-f11-38
           
To delete your jobs above use: scancel -M mae <jobid>

Jobs for everyone:
CLUSTER: physics
             JOBID  PARTITION            NAME     USER ST       TIME  NODES NODELIST(REASON)

To delete your jobs above use: scancel -M physics <jobid>



scavenger-checker and scavenger-checker-all

These commands are very similar to the scavenger-profiler and scavenger-profiler-all commands.  The information is just displayed differently.  The scavenger-checker command will display a list of all available nodes on each cluster and includes the number of cores, memory, SLURM tags (i.e. cpu type), and number of nodes per type.  The scavenger-checker-all command  just the available, nodes in the scavenger partitions on all the clusters.


[ccruser@vortex1:~]$ scavenger-checker
Available nodes on the ub-hpc cluster:

 # OF NODES | # CORES |  MEMORY (mb)  | TAGS
======================================================
    1       |   32    |    512000     | UBHPC&IB&CPU-E7-4830&INTEL                        
    1       |   32    |    256000     | UBHPC&IB&CPU-X7550&INTEL                          
    8       |   32    |    256000     | UBHPC&CPU-6132HE&AMD                              
    13      |   32    |    187000     | UBHPC&OPA&CPU-Gold-6130&INTEL&V100                
    14      |   32    |    754000     | UBHPC&OPA&CPU-Gold-6130&INTEL                    
    1       |   32    |    365000     | UBHPC&OPA&CPU-Gold-6130&INTEL&P100                
    1       |   32    |    187000     | UBHPC&CPU-Gold-6130&INTEL&V100                    

Available nodes on the industry cluster:

 # OF NODES | # CORES |  MEMORY (mb)  | TAGS
======================================================
    47      |   16    |    64000      | INDUSTRY&CPU-E5-2650v2&INTEL                      
    143     |   16    |    64000      | INDUSTRY&IB&CPU-E5-2650v2&INTEL                  

Available nodes on the mae cluster:

 # OF NODES | # CORES |  MEMORY (mb)  | TAGS
======================================================
    5       |   12    |    31000      | MAE&CPU-4162EE&AMD                                
    12      |   12    |    128000     | MAE&CPU-E5-2620v3&INTEL                          
    1       |   16    |    30000      | MAE&CPU-E5-2620v4&INTEL                          

Available nodes on the chemistry cluster:

 # OF NODES | # CORES |  MEMORY (mb)  | TAGS
======================================================
    21      |   8     |    19000+     | CHEMISTRY&IB&CPU-L5520&INTEL                      
    2       |   12    |    48000      | CHEMISTRY&IB&CPU-E5645&INTEL                      
    24      |   24    |    64000+     | CHEMISTRY&CPU-E5-2650v4&INTEL                    
    4       |   12    |    48000      | CHEMISTRY&CPU-E5645&INTEL                        
    4       |   8     |    32000      | CHEMISTRY&CPU-E5620&INTEL                        
    2       |   8     |    32000      | CHEMISTRY&CPU-E5520&INTEL                        
    2       |   8     |    15500      | CHEMISTRY&CPU-E5420&INTEL                        
    19      |   16    |    128000     | CHEMISTRY&IB&CPU-E5-2660&INTEL                    
    7       |   12    |    48000      | CHEMISTRY&CPU-E5-2430&INTEL                      
    35      |   12    |    128000     | CHEMISTRY&CPU-E5-2620v3&INTEL                    
    15      |   16    |    64000      | CHEMISTRY&CPU-E5-2630v3&INTEL                    
    2       |   24    |    128000     | CHEMISTRY&IB&CPU-E5-2650v4&INTEL                  
    7       |   20    |    128000     | CHEMISTRY&IB&CPU-E5-2650v3&INTEL                  
    3       |   24    |    187000     | CHEMISTRY&CPU-Gold-6126&INTEL                    
    2       |   24    |    92000      | CHEMISTRY&CPU-Gold-5118&INTEL                    

Available nodes on the physics cluster:

 # OF NODES | # CORES |  MEMORY (mb)  | TAGS
======================================================
    3       |   4     |    31000      | PHYSICS&CPU-E3-1220v2&INTEL                      
    12      |   12    |    31000      | PHYSICS&CPU-E5-2640&INTEL                        
    32      |   8     |    8500+      | PHYSICS&CPU-L5520&INTEL&IB                        
    23      |   8     |    23000      | PHYSICS&IB&CPU-E5620&INTEL                        
    5       |   20    |    124000     | PHYSICS&CPU-E5-2630v4&INTEL                      
    7       |   24    |    128000     | PHYSICS&CPU-E5-2650v4&INTEL  


Remember, your jobs are running on idle nodes purchased by another faculty member or group and your job can be interrupted at any point!


Why are faculty groups allowing this access on their nodes?

We have a great group of faculty members who understand the importance of maximizing cycles and enabling research.  There are many times they only need their nodes for specific chunks of time when they're working on grant proposals or preparing for presentations or publications.  During their idle times, they're more than happy to share their equipment with other researchers.  We'd like to thank the following faculty for their generous donation of cycles:


Alexey Akimov

Jochen Autschbach

Rajan Batta

Paul Bauman

Sara Behdad

Jason Benedict

Timothy Cook

Paul Desjardin

Michel Dupuis

Johannes Hachmann

John Hall

Margarete Jadamec

Jee Kang

Matthew Knepley

Andrew Murkin

Salvatore Rappoccio

David Salac

Vaikuntanath Samudrala

Puneet Singla

Konstantinos Slavakas

Jose Walteros

Olga Wodo

Peihong Zhang

Eva Zurek

Jaroslaw Zola

UB Genomics & Bioinformatics Core


We really appreciate your willingness to be good citizens!