June 2018 CENTERWIDE Downtime - week of 6/25

UPDATE: 6/27 12PM - The GPFS updates are going well but will take longer than expected.  We have pushed back the GPFS testing to Friday morning.  Users will now be able to run jobs through Wednesday and Thursday.  Then the clusters will be offline for testing GPFS on Friday.  Use --walltime in your SLURM script to specify less run time than the default of 72 hours or your job will remain pending until Friday.



UPDATE:  6/26/18 10:30PM - Our maintenance is on schedule and we have opened up large portions of the clusters for short jobs that will complete before 7am Thursday, 6/27.  Use the --walltime option in your SLURM script to specify less than the default of 72 hours or your job will remain pending until after Thursday.  Nodes currently offline will be worked on and put back into production by Thursday afternoon or sooner.


The June downtime will take place over several days the week  of June 25-29, 2018.  There are many updates being done to ensure the  continued successful operation of all CCR infrastructure.  The basic  schedule will be as follows (barring any unforeseen issues):

 

Monday, June 25th - All Clusters & Portals Down and GPFS  unavailable - DONE

Tuesday, June 26th - All Clusters & Portals Down and  GPFS unavailable - DONE

Wednesday, June 27th – Clusters & Portals Up for short  (<48 hour) Jobs  - AVAILABLE

Thursday, June 28th – Clusters & Portals Up for short  (<48 hour) Jobs  - AVAILABLE

Friday, June 29th – All Clusters & Portals Offline for  GPFS testing

 

During the  June downtime we will be upgrading the core software stack on the GPFS storage  subsystem.  This is a disruptive upgrade which will take up to 4 days to  complete and requires us to take GPFS offline at several points during the  upgrade process.  This upgrade is currently scheduled for June 25-28.    We realize that CCR compute resources being completely unavailable  for this amount of time may be a considerable inconvenience to some users so we  are working with the vendor to make GPFS available for some short time periods  during the upgrade process.  Although we cannot guarantee the amount of  continuous time it will be up, we encourage users who need to run jobs during  this time frame to submit the shortest wall time jobs possible.  This will  ensure the scheduler has adequate time to complete the jobs in these short  windows of availability.  NOTE: the default walltime on the academic &  industry clusters is 72 hours and the faculty clusters are up to 30 days so  users must specify --walltime in their SLURM scripts to avoid the default;  otherwise jobs will not run.


Beginning Monday, June 25th, ALL clusters will be  offline and GPFS will be unavailable on all servers for a few hours.  A  performance benchmark will be run on the GPFS system before it is  upgraded.  Once this is done, CCR system administrators will upgrade the  SLURM scheduling software and update all nodes in all clusters to CentOS  7.5.  This process will most likely take the rest of Monday and all day  Tuesday.  As nodes complete the update process, they will be brought back  online for use.  However, on Friday, the clusters will be offline again  for GPFS performance testing.  If all goes well with the GPFS upgrade, we  anticipate having the clusters back online by Friday afternoon.  Of course, unanticipated issues with  these updates could occur so we recommend you plan for a full week of the  clusters being offline.

 

SLURM CHANGES:  Users should  be specifying both --partition and --qos in their batch scripts.   Currently, if you do not specify a qos, one is added for you before your job  gets submitted to the scheduler.  Once we upgrade SLURM, this will no longer  happen.  If you don’t specify a qos, your job will be rejected by the  scheduler.  Most of the time, the partition and qos are the same:   i.e. --partition=debug   --qos=debug    However, there are times this may be different.  Please see  this article for more details:  https://ubccr.freshdesk.com/solution/articles/13000049957-valid-qos-required


DMTCP checkpoint/restart functionality:  Based on our testing, we've determined that DMTCP checkpointing will not work with the latest version of the scheduling software, SLURM. The DMTCP developers have been alerted to the problem and are currently investigating.  A fix, if one can be found, is not expected until the end of summer.