September 2021: Monthly Maintenance Downtime (9/28/21)

Date of downtime: Tuesday, September 28, 2021

Approximate time of outage: 7am-5pm

Resources affected by downtime:

UB-HPC cluster (all partitions) 

Industry cluster (all partitions )

Faculty cluster (all partitions)

Portals: WebMO, OnDemand, ColdFront

What will be done:  

  • Operating system updates and reboot of all cluster nodes, front-end login nodes (vortex1/2, transfer) and OnDemand
  • Slurm update 
  • OnDemand job monitoring integration with Grafana
  • CUDA update
  • Mellanox and OPA updates
  • Several services running in the Lake Effect research cloud will be migrated to the new cloud and will be offline much of the day.  This includes: OnDemand, ColdFront, Industry Slurm controller

Jobs will be held in queue during the maintenance downtime and will run after the updates are complete.  

You may get this error when submitting a Slurm script after this update:

"sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long"

This means your Slurm batch script is greater than 1mb in size.  You must decrease the size of the file in order to submit the job.

If you have  any questions or concerns please e-mail