March 2016 - Maintenance Downtime

Date of downtime: Monday and Tuesday, March 14-15, 2016

Approximate time of outage: 7am-5pm

Resources affected by downtime:

UB-HPC cluster (general-compute, debug, viz, largemem, and gpu  partitions - plus all PI partitions: chemistry, physics, mae)

Industry cluster (compute, scavenger partitions)


The March downtime will be pushed back to coincide with UB’s spring break and will begin on Monday, March 14th at 7am. This will be a centerwide downtime and affect all servers, clusters and partitions. Now that we have transitioned the home and project directories to the new Isilon storage, it is time to rebuild the GPFS scratch filesystem. This is something that has been needed for quite some time and will, we believe, alleviate the problems we’ve experienced with it over the last year. Unfortunately this necessitates all jobs be stopped while we remove the GPFS client from all servers and transition to a temporary scratch location on the Isilon storage (/ifs/scratch). Please note that the Isilon storage does not offer the same throughput as the GPFS scratch space does so users WILL notice a slowdown while using this temporary scratch directory. The GPFS re-installation will take a week so the transition back to GPFS will take place during the April downtime.


We strongly advise all users to make use of the following environment variables to ease the transition:

GLOBAL_SCRATCH – currently points to /gpfs/scratch; will be changed temporarily to /ifs/scratch while the GPFS cluster is rebuilt

LOCAL_SCRATCH - /scratch on all the nodes

UTIL - /util

Using the environment variables instead of direct paths will avoid having to modify scripts in the future as the underlying infrastructure is modified from time to time.


********************************************

IMPORTANT: All files in /gpfs/scratch will be deleted! Please plan accordingly for this and copy anything you need to your home or project directory or /ifs/scratch BEFORE March 14th! The temporary Isilon scratch space is available so users can begin preparing for the transition now.

********************************************


While the center is offline, we will complete operating system updates on any nodes that have not already been updated and update our other infrastructure servers to the latest software. We anticipate these updates to take 1.5-2 days. Though all nodes in all partitions will not be down that long, we encourage you to plan for a 2 day downtime, just to be safe.