What happens if the final sync takes more than 2 days?
We have almost 1PB of data to move from one storage system to another. Harder than the amount of data is the number of files. For those project directories with hundreds of millions of files, the final sync may take longer than 2 days. We will need to disable the accounts for those groups and put holds on any queued jobs. As the sync completes for your project directory we will re-enable your account and unblock your jobs. If your group falls into this category we will notify you by 3pm on February 2nd.
Is there a way I can determine if my group may fall into this category?
If your project directory has more than 2 million files in it and you've been actively changing these files or adding/removing files in the last few weeks, it is a strong possibility your data will take longer than 2 days to copy. You can check the number of files under your user and group by directories using the iquota command. Instructions can be found here
General time line:
Monday:
7am: Logins for ALL users cut off to all resources listed above
Reboots of cluster nodes and update of filesystem mounts on compute nodes and front end servers for the clusters. COMPLETED
Final sync of /user begins - COMPLETED
Fix of home directory permissions for users with non-standard permissions (if this affects you, you would have received an email from CCR staff on 1/26/21) - COMPLETED
12pm: Final sync of /projects begins IN PROGRESS - 33% complete as of 4:30pm
Updates to mount points on all other servers throughout the center. COMPLETED
Tuesday:
Sync of /projects continues to run. We have broken down some of the larger directories to speed things up. Status as of 3pm:
Sync1 - the majority of project directories - COMPLETED
Sync 2 - 1 very large project subdirectory with hundreds of millions of files - 86% complete
Sync 3 - 1 very large project subdirectory - 50% complete
7am-3pm: Preventative maintenance/testing of machine room cooling system COMPLETED
3pm: Notifications sent to any groups who's project directory is still syncing and will not be able to login at the end of the main downtime. COMPLETED
4pm: Release cluster jobs and allow logins to all users who's directories have completed syncing. COMPLETED
Ongoing:
CCR staff will continue monitoring the progress of project directory syncing and unblock groups as their directories finish syncing. Notifications will go out to these groups individually.
If you have any questions or concerns please e-mail ccr-help_at_buffalo.edu
Dori Sajdak
See progress updates in the 'General Time Line' section below
TL;DR - Systems are back open except for a very small number of users!
Date of downtime: Monday February 1, 2021 at 7am through Tuesday, February 2 at 5pm (at least)
NOTE: This may run more than 2 days depending on how long the final sync of data takes
Approximate time of outage: 7am-5pm
Resources affected by downtime:
EVERYTHING THAT MOUNTS HOME (/user) AND PROJECT (/projects) DIRECTORIES, including:
UB-HPC cluster (all partitions)
Industry cluster (all partitions )
Faculty cluster (all partitions)
Portals: WebMO, OnDemand, IDM, ColdFront
Data Transfer Services: Globus, data transfer node (transfer.ccr.buffalo.edu)
NOTE: this does not affect the Lake Effect research cloud unless your instance is managed by CCR and mounts these file systems.
What will be done:
Jobs will be held in queue during the maintenance downtime and will run after the updates are complete.
What changes will users experience after the data migration?
We have detailed out the changes in this article
What happens if the final sync takes more than 2 days?
We have almost 1PB of data to move from one storage system to another. Harder than the amount of data is the number of files. For those project directories with hundreds of millions of files, the final sync may take longer than 2 days. We will need to disable the accounts for those groups and put holds on any queued jobs. As the sync completes for your project directory we will re-enable your account and unblock your jobs. If your group falls into this category we will notify you by 3pm on February 2nd.
Is there a way I can determine if my group may fall into this category?
If your project directory has more than 2 million files in it and you've been actively changing these files or adding/removing files in the last few weeks, it is a strong possibility your data will take longer than 2 days to copy. You can check the number of files under your user and group by directories using the iquota command. Instructions can be found here
General time line:
Monday:
7am: Logins for ALL users cut off to all resources listed above
Reboots of cluster nodes and update of filesystem mounts on compute nodes and front end servers for the clusters. COMPLETED
Final sync of /user begins - COMPLETED
Fix of home directory permissions for users with non-standard permissions (if this affects you, you would have received an email from CCR staff on 1/26/21) - COMPLETED
12pm: Final sync of /projects begins IN PROGRESS - 33% complete as of 4:30pm
Updates to mount points on all other servers throughout the center. COMPLETED
Tuesday:
Sync of /projects continues to run. We have broken down some of the larger directories to speed things up. Status as of 3pm:
Sync1 - the majority of project directories - COMPLETED
Sync 2 - 1 very large project subdirectory with hundreds of millions of files - 86% complete
Sync 3 - 1 very large project subdirectory - 50% complete
7am-3pm: Preventative maintenance/testing of machine room cooling system COMPLETED
3pm: Notifications sent to any groups who's project directory is still syncing and will not be able to login at the end of the main downtime. COMPLETED
4pm: Release cluster jobs and allow logins to all users who's directories have completed syncing. COMPLETED
Ongoing:
CCR staff will continue monitoring the progress of project directory syncing and unblock groups as their directories finish syncing. Notifications will go out to these groups individually.
If you have any questions or concerns please e-mail ccr-help_at_buffalo.edu