MOVED January 2021: CENTERWIDE Maintenance Downtime (2/1-2)

See progress updates in the 'General Time Line' section below

TL;DR - Systems are back open except for a very small number of users!



Date of downtime: Monday February 1, 2021 at 7am through Tuesday, February 2 at 5pm (at least)

NOTE: This may run more than 2 days depending on how long the final sync of data takes


Approximate time of outage: 7am-5pm


Resources affected by downtime:

EVERYTHING THAT MOUNTS HOME (/user) AND PROJECT (/projects) DIRECTORIES, including:

UB-HPC cluster (all partitions) 

Industry cluster (all partitions )

Faculty cluster (all partitions)

Portals: WebMO, OnDemand, IDM, ColdFront

Data Transfer Services:  Globus, data transfer node (transfer.ccr.buffalo.edu)

NOTE: this does not affect the Lake Effect research cloud unless your instance is managed by CCR and mounts these file systems.


What will be done:  

Jobs will be held in queue during the maintenance downtime and will run after the updates are complete.   


What changes will users experience after the data migration?

We have detailed out the changes in this article


What happens if the final sync takes more than 2 days?

We have almost 1PB of data to move from one storage system to another.  Harder than the amount of data is the number of files.  For those project directories with hundreds of millions of files, the final sync may take longer than 2 days.   We will need to disable the accounts for those groups and put holds on any queued jobs.  As the sync completes for your project directory we will re-enable your account and unblock your jobs.  If your group falls into this category we will notify you by 3pm on February 2nd.


Is there a way I can determine if my group may fall into this category?

If your project directory has more than 2 million files in it and you've been actively changing these files or adding/removing files in the last few weeks, it is a strong possibility your data will take longer than 2 days to copy.  You can check the number of files under your user and group by directories using the iquota command.  Instructions can be found here


General time line:

Monday:

7am: Logins for ALL users cut off to all resources listed above

Reboots of cluster nodes and update of filesystem mounts on compute nodes and front end servers for the clusters.   COMPLETED 

Final sync of /user begins  - COMPLETED

Fix of home directory permissions for users with non-standard permissions (if this affects you, you would have received an email from CCR staff on 1/26/21) - COMPLETED

12pm:  Final sync of /projects begins   IN PROGRESS - 33% complete as of 4:30pm

Updates to mount points on all other servers throughout the center.   COMPLETED  


Tuesday:

Sync of /projects continues to run.  We have broken down some of the larger directories to speed things up.  Status as of 3pm:

Sync1 - the majority of project directories -  COMPLETED  

Sync 2 - 1 very large project subdirectory with hundreds of millions of files - 86% complete

Sync 3 - 1 very large project subdirectory   - 50% complete

7am-3pm:  Preventative maintenance/testing of machine room cooling system  COMPLETED  

3pm: Notifications sent to any groups who's project directory is still syncing and will not be able to login at the end of the main downtime.   COMPLETED  

4pm: Release cluster jobs and allow logins to all users who's directories have completed syncing.  COMPLETED  


Ongoing:

CCR staff will continue monitoring the progress of project directory syncing and unblock groups as their directories finish syncing.  Notifications will go out to these groups individually.


If you have  any questions or concerns please e-mail ccr-help_at_buffalo.edu