Department of Engineering

IT Services

Running Long Programs

To run long jobs, just log into ts-access and run them. When you use ts-access you're logged into one of our faster machines - our Linux servers - which have more memory and 8x more CPU power than the DPO/EIETL machines, with the same user files as on the usual DPO/EIETL linux machines. Fair use of these machines depends on community spirit and peer pressure. Mail helpdesk if you think someone's hogging resources.

Some familiarity with running things from the Unix/MacOS command line is essential (see A short Unix crib). Familiarity with shell scripts is an advantage. You can run the programs as you normally would from the Linux/MacOS command line, but when running long programs note that

  • You probably won't be able to give your program input from the command line, especially if it's asked for long after the program's started.
  • If you put & at the end of your command, you can run your program "in the background", freeing up the command line to do other things.
  • If you put nohup at the start of your command, you can log out without the program being killed when you log out.

so once you're logged into a linux/Unix/MacOS machine at CUED, typing

  ssh ts-access
  ...
  nohup my/program &

will start running my/program and continue running it after you log out of ts-access. Output that would normally appear onscreen goes into a file called nohup.out - though note that the output file won't immediately be updated.

But before you run your program in the background first check that it starts ok when run normally.

Matlab

If you run a matlab job remember to exit from matlab at the end of the script or function, because Matlab won't automatically exit. If, for example, you have a file in your home directory called testme.m containing

disp("hello world!")
exit

you could log into ts-access, type

  nohup matlab -r testme &

then log out of ts-access. Soon in your home directory on CUED's central system you'll have a file called nohup.out containing the output of your program. Matlab will no longer be running on the ts-access machine.

If your program doesn't use the parts of Matlab written in Java (which for number-crunching programs than don't use the Parallel Computing Toolbox is going to be the case) then you can speed things up by using matlab -nodesktop -nojvm instead of matlab.

Preparing your code

Try to write your code so that it saves results periodically, and the program can re-start by loading in those results, carrying on from that stage. In this way you can still make progress even if your programs are interrupted by power-cuts, reboots (which happen some Tuesday evenings at CUED, so that we can do updates) etc.

Many programs will run much faster if a little thought is given to optimising the code. Once programs run for days, even an improvement of a few percent becomes significant. See

for ways of speeding your programs up.

If your program requires interaction you may need to rewrite it so that interaction isn't required. See the Command line options section for help.

Diagnostic and monitoring commands

You may find these commands useful when monitoring the progress of your program or when reporting problems.

  • more /etc/centos-release - tells you the version of the CentOS operating system installed (if any)
  • hostname - displays the name of the machine you're on
  • top - textually displays load average, size and cpu-load of processes, etc. It updates every few seconds. Type 'q' to quit.
  • uname -a - the name of the machine you're on, etc
  • gnome-system-monitor - graphical output showing how busy each CPU core is, etc
  • nproc - shows how many cores are available
  • getconf -a - show how much memory, cache, etc the machine has

Troubleshooting

Your program may fail for several reasons

  • Using too much CPU - the system should be set up so that there's no limit to your CPU usage. Confirm that by typing
    ulimit
    You should get the reply "unlimited".
  • More limits - if you type
    ulimit -a
    you'll get a list of some other limits. Some of these are rather esoteric, but the "open files" limit (the number of simultaneously open files a process can have) sometimes comes into play. Try to close files when you've finished using them.
  • Using too much memory - maybe you have a "memory leak". Each time your program goes round a loop it may ask for more memory until finally there's no more memory left. Try to free memory that you no longer need. You can use the "top" program to monitor memory usage. See Big Processes - Memory issues page for details.
  • The machine was rebooted - For details about when the machine was last rebooted, type
    uptime
    
  • There's a bug in your code that's only triggered after a certain number of iterations or when arrays reach a certain size (because of an unexpected divide-by-zero, or a variable value that becomes bigger than can fit in a variable of that type, etc)

Signals are messages that are sent to processes. Typing

    man 7 signal

will show you a list of them. If your process receives a "SIGSEGV" signal for example, then that generally means a pointer has gone wrong (it's tried to access a piece of memory it's not allowed to) and typically indicates a code bug (most frequently trying to dereference a null pointer). Some signals (e.g. "SIGINT") can be ignored if you choose to do so but "SIGKILL" and "SIGSTOP" can't and will always stop your program. It's possible to add a signal handler to your code to deal with signals. Even if you can't protect your program from being stopped, you might be able to record why it stopped. The Unix Signals and Forking page has some information for C/C++ users.

Using many machines

Outside term (especially over summer, overnight) the DPO terminals aren't heavily used. It's possible (though tedious) to log into many of them and run programs simultaneously. Of course, you shouldn't stop others being able to work, and at any time someone may reboot a machine, but you might get some useful work done.

By tl136 and js138