1/29 at 1:15pm - all clusters and servers back online
Date of downtime: Monday, January 28, 2019 beginning at 12pm - possibly Wednesday, January 30
NOTE: Change of date to accommodate filesystem issue
Approximate time of outage: see details below
Resources affected by downtime:
UB-HPC cluster (general-compute, debug, viz, largemem, and gpu partitions)
Industry cluster (compute, scavenger partitions)
Faculty clusters (all partitions in MAE, Chemistry, Physics)
Portals: WebMO, OnDemand
Storage: GPFS, Budget data
What will be done: Reboot of all cluster nodes. GPFS filessytem check and repair.
We are seeing signs that there may be some filesystem anomalies and/or corruption on one particular directory on GPFS. Although, we do not think that this has spread beyond this one instance, we want to be extremely proactive to prevent further corruption which could possibly lead to data loss. As a result we have rescheduled our monthly downtime from 2/5 to 1/28 so that we can run a proper diagnostic and repair on /gpfs. This maintenance requires that we unmount /gpfs from all CCR assets so that the storage subsystem can be in a quiesced state which will begin at 12pm Monday, January 28th. This ensures the best possible outcome. At this time, we are not sure how long the diagnostic and repair will take as there are millions of files on GPFS and it takes quite a while to scan the integrity of them all. There is a possibility that this downtime could take up to two full days.
Jobs will be held in the queue during the maintenance downtime.
If you have any questions or concerns please e-mail ccr-help_at_buffalo.edu