2024 ARC Winter Maintenance

11/28/2023

By Matt Britt


Winter maintenance is coming up! See the details below. Reach out to [email protected] with questions or if you need help. 

HPC

Like last year, we will have a rolling update which, outside of a few brief interruptions, should keep the clusters in production, Here is the schedule: 

DECEMBER 6, 11 P.M.

  • Update Slurm Controllers: Expect a brief 1-minute interruption when querying Slurm. All jobs will continue to run.
  • Update Open OnDemand Servers: Expect a few seconds of interruption if you are using Open OnDemand.
  • Login Servers Update: We will begin updating our login servers. This update is not expected to impact any users.

 DECEMBER 7:

  • Compute Rolling Updates: We will start rolling updates across all clusters a few nodes at a time, so there should be minimal impact on access to resources.

DECEMBER 19, 10:30A.M.:

  • Update Globus transfer (xfer) nodes: as these nodes are in pairs for each cluster. For the Globus transfer nodes, we will take one node of each pair down at a time, so all Globus services will remain working. If you are using scp/sftp, your jobs may be interrupted, so please schedule these transfers accordingly or use Globus. Total maintenance time should be approximately one hour.

JANUARY 3, 8 A.M.:

  • Reboot Slurm Controller Nodes: This will cause an approximately 10-minute Slurm outage. All running jobs will continue to run.
  • Armis2 Open OnDemand Node: We will reload and reboot the Armis2 Open OnDemand node. This will take approximately 1 hour.
  • Great Lakes and Lighthouse Open OnDemand Nodes: These nodes will be down approximately 10 to 15 minutes.
  • Globus Transfer (xfer) Nodes: These nodes will be rebooted. This will take approximately 15 minutes.
  • Sigbio Login Reload/Reboot: This will take approximately 1 hour.
  • Some Armis2 Faculty Owned Equipment (FOE) nodes will require physical and configuration updates. Expected downtime is 4 hours.

HPC MAINTENANCE NOTES:

  • Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
  • Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC. 
  • New viz partition : there will be a new partition called viz with 16 new GPUs, which can support exactly one GPU per job .
  • The –cpus-per-gpu Slurm bug has been fixed.

HPC MAINTENANCE DETAILS:

NEW version in BOLD

OLD version

Red Hat 8.6 EUS

  • Kernel 4.18.0-372.75.1.el8_6.x86_64

  • glibc-2.28-189.6

  • ucx-1.15.0-1.59056 (OFED provided)

  • gcc-8.5.0-10.1.el8

Red Hat 8.6 EUS

  • Kernel 4.18.0-372.51.1.el8_6.x86_64

  • glibc-2.28-189.6

  • ucx-1.15.0-1.59056 (OFED provided)

  • gcc-8.5.0-10.1.el8

Mlnx-ofa_kernel-modules

  • OFED 5.9.0.5.5.1

    • kver.4.18.0_372.51.1.el8_6

Mlnx-ofa_kernel-modules

  • OFED 5.9.0.5.5.1

    • kver.4.18.0_372.51.1.el8_6

Slurm 23.02.6 copiles with:

  • PMIx

    • /opt/pmix/3.2.5

    • /opt/pmix/4.2.6

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.15.0-1.59056 (OFED provided)

  • slurm-libpmi

  • slurm-contribs

Slurm 23.02.5 copiles with:

  • PMIx

    • /opt/pmix/3.2.5

    • /opt/pmix/4.2.6

  • hwloc 2.2.0-3 (OS provided)

  • ucx-1.15.0-1.59056 (OFED provided)

  • slurm-libpmi

  • slurm-contribs

PMIx LD config /opt/pmix/3.2.5/lib

PMIx LD config /opt/pmix/3.2.5/lib

PMIx versions available in /opt :

  • 3.2.5

  • 4.2.6

PMIx versions available in /opt :

  • 3.2.5

  • 4.2.6

Singularity CE (Sylabs.io)

  • 3.10.4

  • 3.11.1

Singularity CE (Sylabs.io)

  • 3.10.4

  • 3.11.1

NVIDIA driver 545.23.06

NVIDIA driver 530.30.02

Open OnDemand 3.0.3

Open OnDemand 3.0.3

 

STORAGE

There is no scheduled downtime for Turbo, Locker, or  Data Den.

 

SECURE ENCLAVE SERVICE (SES)

  • SES team to add details here

MAINTENANCE NOTES:

  • No downtime for ARC storage systems maintenance (Turbo, Locker, and Data Den).
  • Open OnDemand (OOD) users will need to re-login. Any existing jobs will continue to run and can be reconnected in the OOD portal.
  • Login servers will be updated, and the maintenance should not have any effect on most users. Those who are affected will be contacted directly by ARC. 
  • Copy any data and files that may be needed during maintenance to your local drive using Globus File Transfer before maintenance begins. 
  •  

STATUS UPDATES AND ADDITIONAL INFORMATION

  • Status updates will be available on the ARC Twitter feed and ITS service status page,  throughout the course of the maintenance.
  • ARC will send an email to all HPC users when the maintenance has been completed. 

HOW CAN WE HELP YOU?

For assistance or questions, please contact ARC at [email protected].