HPC Emergency 2023 Maintenance: September 15

9/15/2023

By Stephanie Dascola


Due to a critical issue which requires an immediate update, we will be performing updates to Slurm and underlying libraries which allow parallel jobs to communicate. We will be updating the login nodes and the rest of the cluster on the fly and you should only experience minimal impact when interacting with the clusters. 

  • Jobs that are currently running will be allowed to finish. 
  • All new jobs will only be allowed to run on nodes which have been updated. 
  • The login and Open OnDemand nodes will also be updated, which will require a brief interruption in service.

QUEUED JOBS AND MAINTENANCE REMINDERS

Jobs will remain queued, and will automatically begin after the maintenance is completed. Any parallel using MPI will fail; those jobs may need to be recompiled, as described below. Jobs not using MPI will not be affected by this update.

Jobs will be initially slow to start, as compute nodes are drained of running jobs so they can be updated. We apologize for this inconvenience, and want to assure you that we would not be performing this maintenance during a semester unless it was absolutely necessary.

SOFTWARE UPDATES

Only one version of OpenMPI (version 4.1.6) will be available; all other versions will be removed. Modules for the versions of OpenMPI that were removed will warn you that it is not available, as well as prompt you to load openmpi/4.1.6. 

When you use the following command, it will default to openmpi/4.1.6:
module load openmpi 

Any software packages you use (provided by ARC/LSA/COE/UMMS or yourself) will need to be updated to use openmpi/4.1.6. The software package updates will be completed by ARC. The code you compile yourself will need to be updated by you.

Note that at the moment openmpi/3.1.6 will be discontinued and warned to update your use to openmpi/4.1.6.

STATUS UPDATES