4/23-26: RESOLVED - Problems with CCR storage systems

4/26 2:30pm:  We continue to investigate unusual performance problems on the Isilon storage.
We are leaving jobs in queue until it can be resolved.


4/26 1:15pm:  

Thank you very much for your patience this week!  All  storage systems and clusters are back in production.  We have opened up  logins to the academic & industry cluster front-ends.  The slow logins  and poor performance experienced by users yesterday on the front ends was  caused by users changing their jobs from running on /gpfs/scratch to running  off of the Isilon home and project directories.  We are requesting that  all users who’ve  changed their job scripts to use home (/user) or  project (/project) directories, DELETE any pending jobs.  Change  your scripts to use /gpfs/scratch or to use local /scratch on the  compute nodes and then resubmit your jobs.  This will help keep the load  on the Isilon storage (home & project directories) lower and allow the  system to be responsive for logins, as it was designed to do.   We  will open up the clusters to begin running jobs again at 2pm today, once we see  that job scripts have been modified.  The May 1st downtime is  cancelled and we ask that you report any problems to ccr-help@buffalo.edu or use our help  portal:  https://ubccr.freshdesk.com

 

 

For those that care for more details, this is what happened  and what we did:

 

Last week we began noticing problems with the GPFS  storage.  Corrective actions were taken Thursday and seemed to have  resolved the problems.  Friday, users with files on GPFS created before  Thursday were reporting they could not access them.  Additional actions  were taken to bring the metadata servers back online which made those files  accessible again.  Friday night we began seeing problems with jobs not  completing properly.  They were leaving the compute nodes in a state of  ‘CG’ and not clearing out so new jobs could not start.  The only way to clear  them was to reboot the nodes but this problem persisted.  We then started  seeing problems with GPFS again.  We engaged with the vendor’s support  team to work on the GPFS problems.  Saturday night we began seeing  problems with logins on the front ends and the Isilon home and project  directories were very slow.  There was a drive failing in the Isilon,  which the system is designed to handle, but while data was migrated off to  another drive, this seemed to slow things down considerably.  Once this  finished, the logins to the front ends were better.  Monday morning we  began hearing that older files on GPFS were inaccessible again, though newly  created files were fine.  The vendor suggested we bring down the GPFS  system to run a filesystem check on it.  Many services at CCR rely on GPFS  and it is mounted on all the compute nodes so the process to take if offline  takes several hours.  We felt we could keep jobs running on the clusters  though because there are many people who don’t use /gpfs/scratch.  We  worked to shut down the GPFS system and begin the filesystem checks Tuesday and  they completed overnight and into Wednesday morning.  All affected files  were flagged and repaired.  We began running tests on our development  cluster first thing Wednesday to make sure GPFS was production ready, before we  put it back on the other clusters.  By 11am Wednesday morning, Isilon  system utilization was close to 100% making the systems nearly unusable.   Users could not login as the home directories were unresponsive.  Admins  could not continue work because our commands were hanging as well.  At  this point, we made the decision to stop all jobs to completely quiesce the  system.  Then we’d be able to bring all systems back online and get all  the compute nodes in a healthy state.  Once we stopped the running jobs, the systems became much more  responsive. We suspect that because of GPFS  being unavailable, users  modified their jobs to write to Isilon (IFS) instead, and the Isilon system was  having difficulty keeping up with the load that it was under.  IFS is  designed to be a more robust place to store data rather than a performance  based filesystem.  Our regular maintenance tasks were completed Wednesday  afternoon and all systems were tested overnight Wednesday.  Disk repair  processes are still in progress on the Isilon but performance does not appear  to be affected.  We are monitoring diligently and will act at the first  sign of any adverse impact.  The budget storage system was also impacted  in the last few days but file system checks have been completed and any  affected  files were flagged and have been fixed.





4/26 12pm:  We continue to investigate performance issues on the Isilon storage and can not open the clusters up until it is resolved.


4/26 10am:  We're currently testing all storage and cluster systems and are preparing to open up the clusters by noon.


4/25 4:00pm:  Once we stopped the running jobs the systems became much more responsive. We suspect that because of gpfs being unavailable, users modified their jobs to write to ifs instead of gpfs and the Isilon system was having difficulty keeping up with the load that it was under. IFS is designed to be a more robust place to keep data rather than a performance based filesystem. We will ask all users that if they changed their job scripts to use ifs (/projects /users) directories they delete those jobs in the queue and revert back to gpfs. We are currently testing the systems out now and will open everything back up once we are sure things are good. 



4/25 1:45pm:   The good news is that the GPFS file system was recovered and  files that were previously inaccessible are now in a usable state.However, due to other issues encountered with  other storage systems, we have made the decision to move the maintenance  downtime scheduled for May 1st to today.We will be halting the cluster queues now,  quiescing the nodes, and will get the systems back into a useable state.We apologize, again, for the short  notice.We expect to have all systems  back into full production by noon Thursday.



4/25 7:30am: File checking and repairs are complete.  We're investigating a potential hardware problem on one of the GPFS servers.  We're also testing the GPFS storage performance before putting it back into production.  


4/24 4pm:  File checking continues on the gpfs file system.


4/24 9:30am:  After troubleshooting with the vendor over several days,  they have recommended we take the file system offline and run file system  checks.  These checks will allow us to  find all the affected files and will attempt to correct the problems. There is no guarantee files will be able to  be repaired but if we continue running in this state, we risk affecting new  files.  As of 10am today we will be unmounting /gpfs from all the  compute nodes, front end login machines and other group servers that access  them.  Regrettably, this will cause jobs  that access these files systems, both running and queued, to fail.  We do not take this decision lightly and  would not be doing this without having tried all other options first.We apologize in advance for the issues this may  cause you.  We’ve always stated these  file systems are volatile but we know that many users utilize them and regret  to have to remove access to them with so little notice.


NOTE: Jobs will be queued initially to prevent any from starting as we unmount GPFS.  Once that is complete, we will allow jobs to begin running on the nodes, although without GPFS.



4/23 3pm:  We are troubleshooting reports of file access issues on the GPFS storage system with the vendor.