8/29/22: RESOLVED: Problems with storage affecting many things

8/29/22 4:30pm:  Tentatively marking this as resolved.  Please  report any issues you may encounter to CCR Help so we can investigate.  


8/29/22 10:30am:  The storage remains stable; however, the clusters are still under much lower loads than normal.  Feel free to submit jobs and report any issues you may encounter to CCR Help so we can investigate.  


8/28/22 8am:  The storage remains stable with no signs of the issues we've seen in the last 5 days.  However, the clusters remain under much lower loads than normal.   Please report any issues you encounter to ccr-help so we can troubleshoot. 


8/27/22 12pm:  We have not seen any issues in the last 12 hours or so.   However, the clusters are under a much lower load than normal so it may be we're just not seeing the problems and they will crop up again as users begin running again.  Please report any issues you encounter to ccr-help so we can troubleshoot.


8/26/22 2:45pm:  The storage vendor support tech has spent hours with CCR's storage admin investigating the latency issues we're seeing.  There is no apparent reason for the problems at this time, for example, an obvious hardware failure, so they continue to investigate.  Please know we're doing all we can and this is our highest priority as it affects everything at CCR.  Thanks for your patience!


8/26/22 11am:  Still experiencing slow downs most noticeably with modules loading.  We're investigating.


8/25/22 3pm: The storage has been stable since late this morning.  However, sys admins and vendor support are still monitoring.  Please report any issues to CCR Help as they might not be immediately apparent to us.


8/25/22 8am:  Unfortunately the storage problems seem to still persist.  We are investigating.


8/24/22 - 3:30pm: We are tentatively marking this as resolved.  The vendor completed a storage update to fix the problem we were hitting and rebooted all the controllers.  We are finding all of these issues previously reported back to normal.  There are some older nodes in the two clusters that need to be rebooted to clear out lingering issues but we're working on that now.  For the most part, you should not experience any more problems.  CCR admins and the storage vendor support admins are monitoring the systems.  Please report any further problems to CCR Help.  Thank you for your patience today!


Just to clarify - the problem posted about in this alert can affect pretty much everything at CCR.  The primary storage supports our home and project directories, software installations, slurm job scheduler, etc.  Sometimes you'll get an errors just trying to login, sometimes jobs won't start or start but then immediately cancel, and sometimes you'll get an error that the cluster you're trying to submit to can't be reached.  We're working on it!  Apologies for the inconvenience


8:30am:  We continue to see intermittent issues on nodes and servers throughout the center.  This is currently affecting the faculty cluster to the point where it is unavailable.  We're working to address the problem.


8/24/22 8am:  Storage vendor support team worked with CCR's storage admin through the night and provided a work around to this problem.  Responsiveness improved on the majority of our servers.  Their engineering team is working to test a fix to the problem that cropped up last night.  Once ready, they will apply it on our systems.  


We can confirm there is an issue with slow response times and "laginess" with the home and project directories.  This is causing slow logins, module issues, and problems with reading and writing files in these directories.  This problem can cause jobs to fail, if they are dependent on modules loading.  For example, the Jupyter Notebook, Matlab, and RStudio apps in OnDemand will time out when the modules can't load.  It is unclear what is causing this issue as all storage hardware and monitoring look fine.  We have opened a support case with the storage vendor to help troubleshoot the problem.  This alert will be posted when more information is known.  


1 person likes this