During the June 30, 2020 maintenance downtime the academic (UB-HPC) cluster partition layout will change.



What's happening?

Currently, the cluster nodes are separated into many partitions; debug, viz, general-compute, skylake, cascade, gpu, and largemem.  The general-compute partition will now contain all nodes from the skylake, cascade, gpu, and largemem partitions and these partitions & associated qos values will be deleted.  The debug and viz partitions will remain the same.



Why are we doing this?

There are historical reasons for the separation into many partitions, but at this time, thanks to modifications of the scheduling software over time, our capability to tag resources to allow users to request specific hardware to run on, and our ability to monitor cluster usage in minute detail, we've decided to merge most of these partitions into one.  This should improve the efficiency of the scheduler, encourage users to request only the resources they need, and decrease the wait times for some jobs.



How does this affect me?

  1. Any jobs pending in the queue at the start of the downtime directed to any of the partitions being deleted will be removed.  You will need to resubmit them.  
  2. You will need to change your scripts to remove any partitions and qos values that we're deleting.  Most jobs should now be directed to the general-compute partition.  Since this is the default partition on the academic cluster, you do not need to specify it.
  3. If you have access to any priority boost qos values (nih, mri, supporters) you may use them on the general-compute partition
  4. If you care what type of CPU your jobs run on, you will need to update your scripts to include the --constraint slurm directive and specify the CPU type.  More details
  5. Slurm features are a way to specify exact hardware requests for your jobs.  This is especially useful for CPU and GPU types, fast network interconnects, and even to single out hardware purchased under different grants (NIH, MRI).  More details


More details on the academic cluster partitions


Detailed Hardware Specs by Node Type


Slurm Features


How To Request Specific Hardware When Running Slurm Jobs


Using snodes command to see what's available




PRO TIP:
You can start using Slurm features right NOW!  Partition and QOS will still be required for the skylake, cascade, gpu, largemem partitions until after the June 30th downtime.




My jobs are waiting longer now that you did this!

Remember that after any downtime, the scheduler needs time to get the build-up of jobs through and this can take up to the full 3 days of cycle time.  This downtime we're changing a lot and the wait times may be "off" for the first week or two.  We have setup a slew of monitoring on the scheduler so we know exactly what the average wait times on all the partitions are BEFORE the downtime and we will actively monitor the changes AFTER the downtime.  The usage of the cluster is never exactly the same from week to week so it won't be a perfect comparison, but we will make changes to tweak the scheduler if we find the wait times for certain jobs are unacceptable.  You are more than welcome to contact CCR help if you think there is a problem with your jobs and we can investigate to see if there are any ways you can tweak them to get them started faster.  Keep in mind the #1 mistake users make when submitting jobs is requesting the full 72 hours of wall time and only using a fraction of it.  If your jobs only run for 1 hour or 1 day, do yourself a favor and only request that amount in your job script!


We encourage you to check out recordings from previous workshops that explain in detail about these topics:

1. Introduction to submitting jobs on the CCR clusters (Password: CCR-HPC2020) 

2. Job Submission Strategies & Utilizing Idle Nodes (Password: CCR-HPC2020) 



I used to be able to submit 1000 jobs per partition; now I can only submit to one partition

We are considering increasing the maximum job limit for users on the general-compute partition.  However, with all the other changes being made to the cluster this month, we've decided to hold off on doing so.  If this is causing you problems with your research, please contact CCR Help and we'll discuss if lifting the limit sooner is possible.  Thank you for your understanding!