September 2021: Monthly Maintenance Downtime (9/28/21)
D
Dori Sajdak
started a topic
about 3 years ago
Date of downtime: Tuesday, September 28, 2021
Approximate time of outage: 7am-5pm
Resources affected by downtime:
UB-HPC cluster (all partitions)
Industry cluster (all partitions )
Faculty cluster (all partitions)
Portals: WebMO, OnDemand, ColdFront
What will be done:
Operating system updates and reboot of all cluster nodes, front-end login nodes (vortex1/2, transfer) and OnDemand
Slurm update
OnDemand job monitoring integration with Grafana
CUDA update
Mellanox and OPA updates
Several services running in the Lake Effect research cloud will be migrated to the new cloud and will be offline much of the day. This includes: OnDemand, ColdFront, Industry Slurm controller
Jobs will be held in queue during the maintenance downtime and will run after the updates are complete.
You may get this error when submitting a Slurm script after this update:
"sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long"
This means your Slurm batch script is greater than 1mb in size. You must decrease the size of the file in order to submit the job.
If you have any questions or concerns please e-mail ccr-help_at_buffalo.edu
Dori Sajdak
Date of downtime: Tuesday, September 28, 2021
Approximate time of outage: 7am-5pm
Resources affected by downtime:
UB-HPC cluster (all partitions)
Industry cluster (all partitions )
Faculty cluster (all partitions)
Portals: WebMO, OnDemand, ColdFront
What will be done:
Jobs will be held in queue during the maintenance downtime and will run after the updates are complete.
You may get this error when submitting a Slurm script after this update:
"sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long"
This means your Slurm batch script is greater than 1mb in size. You must decrease the size of the file in order to submit the job.
If you have any questions or concerns please e-mail ccr-help_at_buffalo.edu