RESOLVED: 6/4/24: Cooling outage in the center affecting many services

This issue has been resolved.  Due to the volatility of the Panasas scratch filesystem, any files that may be missing after this outage unfortunately can not be recovered.  As a reminder, this filesystem is being removed from service on June 25 and will be set to read only on June 13 at 9am.  If you have not already moved your workflow to your group's new global scratch directory in /vscratch/grp-[YourGroupName], please do so ASAP.  More details can be found here.


ISSUE:

There was a unplanned chilled water outage early this morning causing temperatures to rise in the CCR machine room.  Many services at CCR are not in production at this time.  We are working to determine if the cooling situation is resolved with UB Facilities and triaging the service outages.  We will work as quickly as possible to bring systems back online once we have confirmation that the cooling issues have been resolved.  Thank you for your patience.


Updates will be posted here when available


KNOWN SERVICES AFFECTED:

- Panasas scratch: Service restored

- OnDemand:  Service restored 

- All compute nodes in both clusters are back online  

- Login nodes: Service restored


NOTE:  Some running jobs ended prematurely because the Panasas mount hung and caused the nodes to enter a bad state. 


Last updated: 6/4/24 5:25pm