RESOLVED: January 8, 2021: Issues with Slurm slowness and job failures on UB-HPC cluster

UPDATE - 1/14/21 1pm:  It's been more than 24 hours since the patch was applied and the system remains stable.  In fact, at the time of this update the ub-hpc is fully utilized with 0 idle processors.  Please report any problems you may encounter to ccr-help.


UPDATE - 1/13/21 11am:  We have applied a patch from the vendor and have allowed the queued jobs to begin running.  We are actively monitoring the cluster to watch for problems and providing data to the vendor.  If any are encountered we will put a reservation in again to block new jobs from being submitted.


UPDATE - 1/13/21 8am:  There was some unresponsiveness on the faculty cluster overnight.  That is working again.  We have applied a patch provided by the vendor to the UB-HPC Slurm controller and it is more responsive.  However, there are less than 1000 jobs running so the load is low.  We are keeping the block in place preventing new jobs from starting.  We will continue to work with the vendor testing this before releasing jobs in the queue.


UPDATE - 1/12/21 8pm:  We are seeing increasing problems with the Slurm job scheduler for the UB-HPC (academic) cluster.  Currently it is crashing and will not stay running for long.  We have put in a reservation to prevent any new jobs from starting.  We are working with the vendor to determine the cause of the problems.  We hope to have a resolution as soon as possible.  


UPDATE - 1/12/21 4pm:  We are continuing to work with the Slurm vendor to troubleshoot the problems we're seeing on the UB-HPC (academic) cluster.  We appreciate that this is inconveniencing many and we are working as quickly as possible to resolve the problems.  Thank you for your patience!


UPDATE - 1/11/21:  We investigated the network, storage, nodes, and job scheduler over the weekend and believe the cause of these problems is the Slurm upgrade.  There is a Slurm bug that is most likely to be the cause of these problems.  We are communicating with the Slurm vendor to get a work around, patch, and/or suggestions for mitigating the problem.  Thank you for your patience.


Types of errors you may be seeing:


srun: Job xxxxx step creation temporarily disabled, retrying (Socket timed out on send/recv operation)

srun: Job xxxxx step creation still disabled, retrying (Socket timed out on send/recv operation)

srun: Job xxxxx step creation still disabled, retrying (Requested nodes are busy)

slurmstepd: error: *** JOB xxxxx  ON cpn-xxx-xx CANCELLED AT 2021-01-08T16:26:25 DUE TO TIME LIMIT ***


1/8/21:

We're aware of general issues with the Slurm job scheduler commands being slow to respond.  We have also had reports of random job failures.  We're investigating these problems but are unsure what is causing them at this time.  We apologize for the inconvenience.