Long-Running Jobs

Resource Utilization and Memory Limit

Because there are a limited number of SCS machines with a limited amount of memory, it is important to be conscious of your use of resources on these machines. Please do not run many instances of the same program on one machine. If you need to run many instances of a program, you should run about half on each of the two machines in the pool. Running all of them on one machine can use up all of the CPU and memory available on that machine. Also, when you are finished with a job, please make sure that you don't leave it running. An unattended process can run for weeks or even months, locking up a license and using up resources.

Remember, these machines are a shared service. As a rule of thumb, you should try not to consume more than about 25% of the processing power on any individual host. This is not enforced at a system level; we rely on you to be a good citizen and will let you know if problems arise with your work.

If the machines become heavily loaded, we reserve the right to ramp that percentage down as needed.

The servers in the pool have a memory limit of 24GB per login per user. (By contrast, the old pool had a memory limit of 8GB.) The limits prevent a runaway process from accidentally consuming all of the memory on one of the machines.

We get a weekly report from each machine in the SCS pool about jobs which seem to be idle for more than four weeks or which may have run wild. We examine those jobs and contact the job owners to determine if the jobs are running correctly.

The Problem

AFS Token Expiration

Your home directory is located in AFS, which is a secure, networked file system. AFS implements its security using the Kerberos protocol. When you log in, you are automatically issued AFS Kerberos tokens, which are set by default to expire after 24 hours. In addition, when you log out of the system, your tokens are invalidated.

Any jobs that remain after either 1) you log out or 2) your tokens expire will no longer be able to write into AFS. Yet there are times when you need a job to run longer than the duration for which you are logged in or for a longer period than the 24-hour maximum.

The Solution

Creating an Extended Screen Session

The following set of commands will initiate an extended screen session with a four-day Kerberos and AFS token lifetime.

  1. Log into scs.dsc.umich.edu
  2. Type "hostname". The system will display the name of the machine you are currently using. Note this name, as you will need it later.
  3. Run:
    pagsh
    export KRB5CCNAME=FILE:`mktemp -p /ticket krb5cc_screen_XXXXXX`
    kinit -l 4d
    aklog
    screen -S sessionname

    where "sessionname" is a one-word name for what you are doing.

You now have an extended screen session. You can run any commands as normal.

Logging Out Without Stopping Your Job

When you go to logout:

  1. First type "ctrl-a d" (press the Control key and "a" key simultaneously and then type "d"). This detaches your screen session.
  2. You can now logout, but your job will continue running within the screen session.

Logging Back into an Extended Screen Session

To reconnect:

  1. ssh to the hostname you noted when you created the session (you won't be able to use scs.dsc.umich.edu, because that could connect you to any of the handful of the machines we have in the SCS pool).
  2. Type:
    screen -d -r sessionname

    where "sessioname" is the name you chose above.

You will be re-attached to the screen session, where your command will still be running. You can detatch and reattach at will.

Creating Additional Windows Within the Session

You can create additional "windows" within the screen by typing "ctrl-a c". You can do this any number of times. You can also use "crtl-a p" to navigate to the previous window, and "crtl-a n" to go to the next window.

Extending Your Extended Session

The creation of additional windows from within your extended session (see above) becomes handy when you are trying to renew your Kerberos and AFS credentials:

  1. Switch to or open a window containing a shell prompt.
  2. Type:
    kinit -l 4d
    aklog

This will grant you new four-daylong Kerberos tickets and new AFS credentials. Since you are doing it in-screen, and the screen session has its own set of credentials, you have renewed the credentials for everything within that screen session.

In order to extend the session indefinitely, you simply have to remember to log in and connect to this screen session and renew your Kerberos/AFS credentials before they run out (a good rule of thumb is to do it every day or so).