2024 ARC Winter Maintenance / U-M Information and Technology Services

11/28/2023

By Matt Britt

Winter maintenance is coming up! See the details below. Reach out to [email protected] with questions or if you need help.

HPC

Like last year, we will have a rolling update which, outside of a few brief interruptions, should keep the clusters in production, Here is the schedule:

DECEMBER 6, 11 P.M.

Update Slurm Controllers: Expect a brief 1-minute interruption when querying Slurm. All jobs will continue to run.
Update Open OnDemand Servers: Expect a few seconds of interruption if you are using Open OnDemand.
Login Servers Update: We will begin updating our login servers. This update is not expected to impact any users.

DECEMBER 7:

Compute Rolling Updates: We will start rolling updates across all clusters a few nodes at a time, so there should be minimal impact on access to resources.

DECEMBER 19, 10:30A.M.:

Update Globus transfer (xfer) nodes: as these nodes are in pairs for each cluster. For the Globus transfer nodes, we will take one node of each pair down at a time, so all Globus services will remain working. If you are using scp/sftp, your jobs may be interrupted, so please schedule these transfers accordingly or use Globus. Total maintenance time should be approximately one hour.

JANUARY 3, 8 A.M.:

Reboot Slurm Controller Nodes: This will cause an approximately 10-minute Slurm outage. All running jobs will continue to run.
Armis2 Open OnDemand Node: We will reload and reboot the Armis2 Open OnDemand node. This will take approximately 1 hour.
Great Lakes and Lighthouse Open OnDemand Nodes: These nodes will be down approximately 10 to 15 minutes.
Globus Transfer (xfer) Nodes: These nodes will be rebooted. This will take approximately 15 minutes.
Sigbio Login Reload/Reboot: This will take approximately 1 hour.
Some Armis2 Faculty Owned Equipment (FOE) nodes will require physical and configuration updates. Expected downtime is 4 hours.

HPC MAINTENANCE NOTES:

Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC.
New viz partition : there will be a new partition called viz with 16 new GPUs, which can support exactly one GPU per job .
The –cpus-per-gpu Slurm bug has been fixed.

HPC MAINTENANCE DETAILS:

NEW version in BOLD

OLD version

Red Hat 8.6 EUS

Kernel 4.18.0-372.75.1.el8_6.x86_64
glibc-2.28-189.6
ucx-1.15.0-1.59056 (OFED provided)
gcc-8.5.0-10.1.el8

Red Hat 8.6 EUS

Kernel 4.18.0-372.51.1.el8_6.x86_64
glibc-2.28-189.6
ucx-1.15.0-1.59056 (OFED provided)
gcc-8.5.0-10.1.el8

Mlnx-ofa_kernel-modules

OFED 5.9.0.5.5.1
- kver.4.18.0_372.51.1.el8_6

Mlnx-ofa_kernel-modules

OFED 5.9.0.5.5.1
- kver.4.18.0_372.51.1.el8_6

Slurm 23.02.6 copiles with:

PMIx
- /opt/pmix/3.2.5
- /opt/pmix/4.2.6
hwloc 2.2.0-3 (OS provided)
ucx-1.15.0-1.59056 (OFED provided)
slurm-libpmi
slurm-contribs

Slurm 23.02.5 copiles with:

PMIx
- /opt/pmix/3.2.5
- /opt/pmix/4.2.6
hwloc 2.2.0-3 (OS provided)
ucx-1.15.0-1.59056 (OFED provided)
slurm-libpmi
slurm-contribs

PMIx LD config /opt/pmix/3.2.5/lib

PMIx versions available in /opt :

3.2.5
4.2.6

PMIx versions available in /opt :

3.2.5
4.2.6

Singularity CE (Sylabs.io)

3.10.4
3.11.1

Singularity CE (Sylabs.io)

3.10.4
3.11.1

NVIDIA driver 545.23.06

NVIDIA driver 530.30.02

Open OnDemand 3.0.3

STORAGE

There is no scheduled downtime for Turbo, Locker, or Data Den.

SECURE ENCLAVE SERVICE (SES)

SES team to add details here

MAINTENANCE NOTES:

No downtime for ARC storage systems maintenance (Turbo, Locker, and Data Den).
Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC.
Copy any data and files that may be needed during maintenance to your local drive using Globus File Transfer before maintenance begins.

STATUS UPDATES AND ADDITIONAL INFORMATION

Status updates will be available on the ARC Twitter feed and ITS service status page, throughout the course of the maintenance.
ARC will send an email to all HPC users when the maintenance has been completed.

HOW CAN WE HELP YOU?

For assistance or questions, please contact ARC at [email protected].