June 2018 CENTERWIDE Downtime - week of 6/25 : Center for Computational Research

June 2018 CENTERWIDE Downtime - week of 6/25

Dori Sajdak

started a topic over 6 years ago

UPDATE: 6/27 12PM - The GPFS updates are going well but will take longer than expected. We have pushed back the GPFS testing to Friday morning. Users will now be able to run jobs through Wednesday and Thursday. Then the clusters will be offline for testing GPFS on Friday. Use --walltime in your SLURM script to specify less run time than the default of 72 hours or your job will remain pending until Friday.

UPDATE: 6/26/18 10:30PM - Our maintenance is on schedule and we have opened up large portions of the clusters for short jobs that will complete before 7am Thursday, 6/27. Use the --walltime option in your SLURM script to specify less than the default of 72 hours or your job will remain pending until after Thursday. Nodes currently offline will be worked on and put back into production by Thursday afternoon or sooner.

The June downtime will take place over several days the week of June 25-29, 2018. There are many updates being done to ensure the continued successful operation of all CCR infrastructure. The basic schedule will be as follows (barring any unforeseen issues):

Monday, June 25th - All Clusters & Portals Down and GPFS unavailable - DONE

Tuesday, June 26th - All Clusters & Portals Down and GPFS unavailable - DONE

Wednesday, June 27th – Clusters & Portals Up for short (<48 hour) Jobs - AVAILABLE

Thursday, June 28th – Clusters & Portals Up for short (<48 hour) Jobs - AVAILABLE

Friday, June 29th – All Clusters & Portals Offline for GPFS testing

During the June downtime we will be upgrading the core software stack on the GPFS storage subsystem. This is a disruptive upgrade which will take up to 4 days to complete and requires us to take GPFS offline at several points during the upgrade process. This upgrade is currently scheduled for June 25-28. We realize that CCR compute resources being completely unavailable for this amount of time may be a considerable inconvenience to some users so we are working with the vendor to make GPFS available for some short time periods during the upgrade process. Although we cannot guarantee the amount of continuous time it will be up, we encourage users who need to run jobs during this time frame to submit the shortest wall time jobs possible. This will ensure the scheduler has adequate time to complete the jobs in these short windows of availability. NOTE: the default walltime on the academic & industry clusters is 72 hours and the faculty clusters are up to 30 days so users must specify --walltime in their SLURM scripts to avoid the default; otherwise jobs will not run.

Beginning Monday, June 25^th, ALL clusters will be offline and GPFS will be unavailable on all servers for a few hours. A performance benchmark will be run on the GPFS system before it is upgraded. Once this is done, CCR system administrators will upgrade the SLURM scheduling software and update all nodes in all clusters to CentOS 7.5. This process will most likely take the rest of Monday and all day Tuesday. As nodes complete the update process, they will be brought back online for use. However, on Friday, the clusters will be offline again for GPFS performance testing. If all goes well with the GPFS upgrade, we anticipate having the clusters back online by Friday afternoon. Of course, unanticipated issues with these updates could occur so we recommend you plan for a full week of the clusters being offline.

SLURM CHANGES: Users should be specifying both --partition and --qos in their batch scripts. Currently, if you do not specify a qos, one is added for you before your job gets submitted to the scheduler. Once we upgrade SLURM, this will no longer happen. If you don’t specify a qos, your job will be rejected by the scheduler. Most of the time, the partition and qos are the same: i.e. --partition=debug --qos=debug However, there are times this may be different. Please see this article for more details: https://ubccr.freshdesk.com/solution/articles/13000049957-valid-qos-required

DMTCP checkpoint/restart functionality: Based on our testing, we've determined that DMTCP checkpointing will not work with the latest version of the scheduling software, SLURM. The DMTCP developers have been alerted to the problem and are currently investigating. A fix, if one can be found, is not expected until the end of summer.