4/23-26: RESOLVED - Problems with CCR storage systems
D
Dori Sajdak
started a topic
over 6 years ago
4/26 2:30pm: We continue to investigate unusual performance problems on the Isilon storage. We are leaving jobs in queue until it can be resolved.
4/26 1:15pm:
Thank you very much for your patience this week! All storage systems and clusters are back in production. We have opened up logins to the academic & industry cluster front-ends. The slow logins and poor performance experienced by users yesterday on the front ends was caused by users changing their jobs from running on /gpfs/scratch to running off of the Isilon home and project directories. We are requesting that all users who’ve changed their job scripts to use home (/user) or project (/project) directories, DELETE any pending jobs. Change your scripts to use /gpfs/scratch or to use local /scratch on the compute nodes and then resubmit your jobs. This will help keep the load on the Isilon storage (home & project directories) lower and allow the system to be responsive for logins, as it was designed to do. We will open up the clusters to begin running jobs again at 2pm today, once we see that job scripts have been modified. The May 1st downtime is cancelled and we ask that you report any problems to ccr-help@buffalo.edu or use our help portal: https://ubccr.freshdesk.com
For those that care for more details, this is what happened and what we did:
Last week we began noticing problems with the GPFS storage. Corrective actions were taken Thursday and seemed to have resolved the problems. Friday, users with files on GPFS created before Thursday were reporting they could not access them. Additional actions were taken to bring the metadata servers back online which made those files accessible again. Friday night we began seeing problems with jobs not completing properly. They were leaving the compute nodes in a state of ‘CG’ and not clearing out so new jobs could not start. The only way to clear them was to reboot the nodes but this problem persisted. We then started seeing problems with GPFS again. We engaged with the vendor’s support team to work on the GPFS problems. Saturday night we began seeing problems with logins on the front ends and the Isilon home and project directories were very slow. There was a drive failing in the Isilon, which the system is designed to handle, but while data was migrated off to another drive, this seemed to slow things down considerably. Once this finished, the logins to the front ends were better. Monday morning we began hearing that older files on GPFS were inaccessible again, though newly created files were fine. The vendor suggested we bring down the GPFS system to run a filesystem check on it. Many services at CCR rely on GPFS and it is mounted on all the compute nodes so the process to take if offline takes several hours. We felt we could keep jobs running on the clusters though because there are many people who don’t use /gpfs/scratch. We worked to shut down the GPFS system and begin the filesystem checks Tuesday and they completed overnight and into Wednesday morning. All affected files were flagged and repaired. We began running tests on our development cluster first thing Wednesday to make sure GPFS was production ready, before we put it back on the other clusters. By 11am Wednesday morning, Isilon system utilization was close to 100% making the systems nearly unusable. Users could not login as the home directories were unresponsive. Admins could not continue work because our commands were hanging as well. At this point, we made the decision to stop all jobs to completely quiesce the system. Then we’d be able to bring all systems back online and get all the compute nodes in a healthy state. Once we stopped the running jobs, the systems became much more responsive. We suspect that because of GPFS being unavailable, users modified their jobs to write to Isilon (IFS) instead, and the Isilon system was having difficulty keeping up with the load that it was under. IFS is designed to be a more robust place to store data rather than a performance based filesystem. Our regular maintenance tasks were completed Wednesday afternoon and all systems were tested overnight Wednesday. Disk repair processes are still in progress on the Isilon but performance does not appear to be affected. We are monitoring diligently and will act at the first sign of any adverse impact. The budget storage system was also impacted in the last few days but file system checks have been completed and any affected files were flagged and have been fixed.
4/26 12pm: We continue to investigate performance issues on the Isilon storage and can not open the clusters up until it is resolved.
4/26 10am: We're currently testing all storage and cluster systems and are preparing to open up the clusters by noon.
4/25 4:00pm: Once we stopped the running jobs the systems became much more responsive. We suspect that because of gpfs being unavailable, users modified their jobs to write to ifs instead of gpfs and the Isilon system was having difficulty keeping up with the load that it was under. IFS is designed to be a more robust place to keep data rather than a performance based filesystem. We will ask all users that if they changed their job scripts to use ifs (/projects /users) directories they delete those jobs in the queue and revert back to gpfs. We are currently testing the systems out now and will open everything back up once we are sure things are good.
4/25 1:45pm: The good news is that the GPFS file system was recovered and files that were previously inaccessible are now in a usable state.However, due to other issues encountered with other storage systems, we have made the decision to move the maintenance downtime scheduled for May 1st to today.We will be halting the cluster queues now, quiescing the nodes, and will get the systems back into a useable state.We apologize, again, for the short notice.We expect to have all systems back into full production by noon Thursday.
4/25 7:30am: File checking and repairs are complete. We're investigating a potential hardware problem on one of the GPFS servers. We're also testing the GPFS storage performance before putting it back into production.
4/24 4pm: File checking continues on the gpfs file system.
4/24 9:30am: After troubleshooting with the vendor over several days, they have recommended we take the file system offline and run file system checks. These checks will allow us to find all the affected files and will attempt to correct the problems. There is no guarantee files will be able to be repaired but if we continue running in this state, we risk affecting new files. As of 10am today we will be unmounting /gpfs from all the compute nodes, front end login machines and other group servers that access them. Regrettably, this will cause jobs that access these files systems, both running and queued, to fail. We do not take this decision lightly and would not be doing this without having tried all other options first.We apologize in advance for the issues this may cause you. We’ve always stated these file systems are volatile but we know that many users utilize them and regret to have to remove access to them with so little notice.
NOTE: Jobs will be queued initially to prevent any from starting as we unmount GPFS. Once that is complete, we will allow jobs to begin running on the nodes, although without GPFS.
4/23 3pm: We are troubleshooting reports of file access issues on the GPFS storage system with the vendor.
Dori Sajdak
4/26 2:30pm: We continue to investigate unusual performance problems on the Isilon storage.
We are leaving jobs in queue until it can be resolved.
4/26 1:15pm:
Thank you very much for your patience this week! All storage systems and clusters are back in production. We have opened up logins to the academic & industry cluster front-ends. The slow logins and poor performance experienced by users yesterday on the front ends was caused by users changing their jobs from running on /gpfs/scratch to running off of the Isilon home and project directories. We are requesting that all users who’ve changed their job scripts to use home (/user) or project (/project) directories, DELETE any pending jobs. Change your scripts to use /gpfs/scratch or to use local /scratch on the compute nodes and then resubmit your jobs. This will help keep the load on the Isilon storage (home & project directories) lower and allow the system to be responsive for logins, as it was designed to do. We will open up the clusters to begin running jobs again at 2pm today, once we see that job scripts have been modified. The May 1st downtime is cancelled and we ask that you report any problems to ccr-help@buffalo.edu or use our help portal: https://ubccr.freshdesk.com
For those that care for more details, this is what happened and what we did:
Last week we began noticing problems with the GPFS storage. Corrective actions were taken Thursday and seemed to have resolved the problems. Friday, users with files on GPFS created before Thursday were reporting they could not access them. Additional actions were taken to bring the metadata servers back online which made those files accessible again. Friday night we began seeing problems with jobs not completing properly. They were leaving the compute nodes in a state of ‘CG’ and not clearing out so new jobs could not start. The only way to clear them was to reboot the nodes but this problem persisted. We then started seeing problems with GPFS again. We engaged with the vendor’s support team to work on the GPFS problems. Saturday night we began seeing problems with logins on the front ends and the Isilon home and project directories were very slow. There was a drive failing in the Isilon, which the system is designed to handle, but while data was migrated off to another drive, this seemed to slow things down considerably. Once this finished, the logins to the front ends were better. Monday morning we began hearing that older files on GPFS were inaccessible again, though newly created files were fine. The vendor suggested we bring down the GPFS system to run a filesystem check on it. Many services at CCR rely on GPFS and it is mounted on all the compute nodes so the process to take if offline takes several hours. We felt we could keep jobs running on the clusters though because there are many people who don’t use /gpfs/scratch. We worked to shut down the GPFS system and begin the filesystem checks Tuesday and they completed overnight and into Wednesday morning. All affected files were flagged and repaired. We began running tests on our development cluster first thing Wednesday to make sure GPFS was production ready, before we put it back on the other clusters. By 11am Wednesday morning, Isilon system utilization was close to 100% making the systems nearly unusable. Users could not login as the home directories were unresponsive. Admins could not continue work because our commands were hanging as well. At this point, we made the decision to stop all jobs to completely quiesce the system. Then we’d be able to bring all systems back online and get all the compute nodes in a healthy state. Once we stopped the running jobs, the systems became much more responsive. We suspect that because of GPFS being unavailable, users modified their jobs to write to Isilon (IFS) instead, and the Isilon system was having difficulty keeping up with the load that it was under. IFS is designed to be a more robust place to store data rather than a performance based filesystem. Our regular maintenance tasks were completed Wednesday afternoon and all systems were tested overnight Wednesday. Disk repair processes are still in progress on the Isilon but performance does not appear to be affected. We are monitoring diligently and will act at the first sign of any adverse impact. The budget storage system was also impacted in the last few days but file system checks have been completed and any affected files were flagged and have been fixed.
4/26 12pm: We continue to investigate performance issues on the Isilon storage and can not open the clusters up until it is resolved.
4/26 10am: We're currently testing all storage and cluster systems and are preparing to open up the clusters by noon.
4/25 4:00pm: Once we stopped the running jobs the systems became much more responsive. We suspect that because of gpfs being unavailable, users modified their jobs to write to ifs instead of gpfs and the Isilon system was having difficulty keeping up with the load that it was under. IFS is designed to be a more robust place to keep data rather than a performance based filesystem. We will ask all users that if they changed their job scripts to use ifs (/projects /users) directories they delete those jobs in the queue and revert back to gpfs. We are currently testing the systems out now and will open everything back up once we are sure things are good.
4/25 1:45pm: The good news is that the GPFS file system was recovered and files that were previously inaccessible are now in a usable state.However, due to other issues encountered with other storage systems, we have made the decision to move the maintenance downtime scheduled for May 1st to today.We will be halting the cluster queues now, quiescing the nodes, and will get the systems back into a useable state.We apologize, again, for the short notice.We expect to have all systems back into full production by noon Thursday.
4/25 7:30am: File checking and repairs are complete. We're investigating a potential hardware problem on one of the GPFS servers. We're also testing the GPFS storage performance before putting it back into production.
4/24 4pm: File checking continues on the gpfs file system.
4/24 9:30am: After troubleshooting with the vendor over several days, they have recommended we take the file system offline and run file system checks. These checks will allow us to find all the affected files and will attempt to correct the problems. There is no guarantee files will be able to be repaired but if we continue running in this state, we risk affecting new files. As of 10am today we will be unmounting /gpfs from all the compute nodes, front end login machines and other group servers that access them. Regrettably, this will cause jobs that access these files systems, both running and queued, to fail. We do not take this decision lightly and would not be doing this without having tried all other options first.We apologize in advance for the issues this may cause you. We’ve always stated these file systems are volatile but we know that many users utilize them and regret to have to remove access to them with so little notice.
NOTE: Jobs will be queued initially to prevent any from starting as we unmount GPFS. Once that is complete, we will allow jobs to begin running on the nodes, although without GPFS.
4/23 3pm: We are troubleshooting reports of file access issues on the GPFS storage system with the vendor.