RESCHEDULED! July 2024: Monthly Maintenance Downtime NOW ON 8/6/24
D
Dori Sajdak
started a topic
6 months ago
Date of downtime: Tuesday, August 6, 2024
This has been rescheduled to accommodate additional testing needed for the Slurm upgrade
Approximate time of outage: 7am-5pm
Resources affected by downtime:
UB-HPC cluster (all partitions)
Faculty cluster (all partitions)
Portals: OnDemand, ColdFront, IDM
What will be done:
Operating system updates and reboot of all cluster nodes
Updates of front-end login nodes (login1/2, vortex-future) and OnDemand
Infrastructure services updated
Slurm upgrade
Due to the Slurm upgrade ALL jobs in the queues of the UB-HPC and Faculty clusters will be deleted
Network infrastructure updates
Remove OmniPath (OPA) network from production
Affects on users:
Jobs in the queues were deleted and will need to be resubmitted.
Slurm accounts and associations were recreated. If you have access to multiple Slurm accounts, this may have had an affect on what default Slurm account was set as your default and this may be different than what it was before. Not all Slurm accounts have access to the same resources. You may attempt to submit a Slurm job and see an "invalid QOS" error. When you have access to multiple Slurm accounts, you should always use the --account=XXXX Slurm directive in your batch scripts and interactive job requests to specify which account the job should be run under.
If you use the OPA tag (i.e. --constraint=OPA) in your Slurm scripts, interactive job requests or OnDemand jobs this will no longer work. Please switch to IB for the Infiniband network.
Please see updated documentation on interactive jobs to properly request a node and have your job environment setup correctly.
Earlier this year we began making the recommendation that users change their Slurm scripts to prepare for changes we'd be making to login & job environments and for a large Slurm upgrade this summer. Those changes will be fully implemented with the completion of this maintenance downtime. If you haven't done so already, please update your Slurm scripts so the first line reads:
#!/bin/bash -l If you have anything other than this in your scripts you will see errors like:
- /var/spool/slurmd/jobxxxxxx/slurm_script: /etc/profile.d/99-ccr.sh: [[: not found - /var/spool/slurmd/jobXXXXX/slurm_script: module: not found - module command not found - and other environment issues
Dori Sajdak
Date of downtime: Tuesday, August 6, 2024
This has been rescheduled to accommodate additional testing needed for the Slurm upgrade
Approximate time of outage: 7am-5pm
Resources affected by downtime:
UB-HPC cluster (all partitions)
Faculty cluster (all partitions)
Portals: OnDemand, ColdFront, IDM
What will be done:
Affects on users:
#!/bin/bash -l
If you have anything other than this in your scripts you will see errors like:
- /var/spool/slurmd/jobxxxxxx/slurm_script: /etc/profile.d/99-ccr.sh: [[: not found
- /var/spool/slurmd/jobXXXXX/slurm_script: module: not found
- module command not found
- and other environment issues