By Matt Britt
Winter maintenance is coming up! See the details below. Reach out to [email protected] with questions or if you need help.
HPC
Like last year, we will have a rolling update which, outside of a few brief interruptions, should keep the clusters in production, Here is the schedule:
DECEMBER 6, 11 P.M.
- Update Slurm Controllers: Expect a brief 1-minute interruption when querying Slurm. All jobs will continue to run.
- Update Open OnDemand Servers: Expect a few seconds of interruption if you are using Open OnDemand.
- Login Servers Update: We will begin updating our login servers. This update is not expected to impact any users.
DECEMBER 7:
- Compute Rolling Updates: We will start rolling updates across all clusters a few nodes at a time, so there should be minimal impact on access to resources.
DECEMBER 19, 10:30A.M.:
- Update Globus transfer (xfer) nodes: as these nodes are in pairs for each cluster. For the Globus transfer nodes, we will take one node of each pair down at a time, so all Globus services will remain working. If you are using scp/sftp, your jobs may be interrupted, so please schedule these transfers accordingly or use Globus. Total maintenance time should be approximately one hour.
JANUARY 3, 8 A.M.:
- Reboot Slurm Controller Nodes: This will cause an approximately 10-minute Slurm outage. All running jobs will continue to run.
- Armis2 Open OnDemand Node: We will reload and reboot the Armis2 Open OnDemand node. This will take approximately 1 hour.
- Great Lakes and Lighthouse Open OnDemand Nodes: These nodes will be down approximately 10 to 15 minutes.
- Globus Transfer (xfer) Nodes: These nodes will be rebooted. This will take approximately 15 minutes.
- Sigbio Login Reload/Reboot: This will take approximately 1 hour.
- Some Armis2 Faculty Owned Equipment (FOE) nodes will require physical and configuration updates. Expected downtime is 4 hours.
HPC MAINTENANCE NOTES:
- Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
- Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC.
- New viz partition : there will be a new partition called
viz
with 16 new GPUs, which can support exactly one GPU per job . - The –cpus-per-gpu Slurm bug has been fixed.
HPC MAINTENANCE DETAILS:
NEW version in BOLD |
OLD version |
Red Hat 8.6 EUS
|
Red Hat 8.6 EUS
|
Mlnx-ofa_kernel-modules
|
Mlnx-ofa_kernel-modules
|
Slurm 23.02.6 copiles with:
|
Slurm 23.02.5 copiles with:
|
PMIx LD config /opt/pmix/3.2.5/lib |
PMIx LD config /opt/pmix/3.2.5/lib |
PMIx versions available in /opt :
|
PMIx versions available in /opt :
|
Singularity CE (Sylabs.io)
|
Singularity CE (Sylabs.io)
|
NVIDIA driver 545.23.06 |
NVIDIA driver 530.30.02 |
Open OnDemand 3.0.3 |
Open OnDemand 3.0.3 |
STORAGE
There is no scheduled downtime for Turbo, Locker, or Data Den.
SECURE ENCLAVE SERVICE (SES)
- SES team to add details here
MAINTENANCE NOTES:
- No downtime for ARC storage systems maintenance (Turbo, Locker, and Data Den).
- Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
- Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC.
- Copy any data and files that may be needed during maintenance to your local drive using Globus File Transfer before maintenance begins.
STATUS UPDATES AND ADDITIONAL INFORMATION
- Status updates will be available on the ARC Twitter feed and ITS service status page, throughout the course of the maintenance.
- ARC will send an email to all HPC users when the maintenance has been completed.
HOW CAN WE HELP YOU?
For assistance or questions, please contact ARC at [email protected].