Department of Engineering

University of Cambridge > Engineering Department > Computing Help > Unix

Grid Engine

Contents icon

Preparing your code
Setting up
Running jobs (Resource limits, Matlab, ABAQUS)
Graphical interface: qmon
Command line interface
Troubleshooting
Documentation on other sites

If you want to run programs while you aren't logged in, you can use our Grid Engine facility. This lets you use the CPU-power of CUED Teaching System Linux-based machines when they're not used for other tasks. Outside term, several machines are available all day. During term, machines are likely to run your programs only at weekends and on weekdays between 6pm and 9am, but you can submit your programs at any time. At its simplest you

create a text file (called forthegrid, say) in your home folder containing the following text (where myprogramname is the name of your program which is in the same folder)
```
  #!/bin/sh
  #$ -S /bin/bash
  ./myprogramname
```
log into one of the Teaching System's Linux Servers (by typing slogin ts-access for example)

type

  chmod u+x forthegrid
  source /usr/local/apps/gridengine/blades/common/settings.sh
  qsub -m be ./forthegrid

You'll be mailed when your program starts and finishes. Your program will produce the files it usually produces. Text output will go into a file called forthegrid.o*. Most of the rest of this page tells you what to do if things go wrong.

Preparing your code

Some programs will work under Grid Engine with no extra work. Others may require recompiling or rewriting. Many will run much faster if a little thought is given to optimising the code. Once programs run for days, even an improvement of a few percent becomes significant. See

for ways of speeding your programs up.

Under Grid Engine your program will be run by default in your home directory, and your PATH (the list of places where programs are looked for) will be /usr/local/bin:/bin:/usr/bin. To start your main program in a different directory, use cd in the script file - see below - or use qsub -cwd ... instead of qsub .... The "-cwd" option makes your program run in the same directory that qsub was run in. You can add directories to the PATH too.

If your program requires interaction you'll have to rewrite it so that interaction isn't required. See the Command line options section for help.

Note that your program won't run much faster than it would on your own machine, but Grid Engine lets you run many programs at once, so if you structure your work appropriately you can increase your work-rate by an order of magnitude or so.

Setting up

You can use Grid Engine only from the Teaching System's Linux Servers, so you'll need to log into one of those before doing anything else. You'll then need to set your environment up by running

        source /usr/local/apps/gridengine/blades/common/settings.sh

if your login shell is bash (the default), or by running

        source /usr/local/apps/gridengine/blades/common/settings.csh

if your login shell is csh or tcsh.

You can cause this initialisation to always happen when you log in by editing the appropriate shell initialisation file (as documented in various places including CUED's shell script page).

Running jobs

In order to run jobs, you will need to write a shell script (even if the program you actually want to run is a compiled program, Grid Engine insists on a script being submitted; it need do no more than call your program directly).

A typical minimal script (with comments) would be

#!/bin/sh # # the next line is a "magic" comment that tells codine to use bash #$ -S /bin/bash # # if you want to start in a directory other than your home # directory, uncomment and change the next line # cd ~/folder/to/start/in # if you want to add directories to the PATH, uncomment and # change the next line # export PATH=$PATH:/directory/to/add:/other/directory/to/add # Now for some real work myprogramname argument1 argument2

You need to make this script executable - see the Unix groups and file permissions page, or just type

        chmod u+x scriptname

Now you can submit a job on the queue:

        qsub ./scriptname

You should get a response like

        Your job 24 ("scriptname") has been submitted.

The number (in this case 24) is the "job ID", which you'll need if you're going to cancel the job or report a bug. If there's a machine free to run your job, execution will begin straight away. Otherwise your job joins the queue of all the other jobs that are pending.

Resource limits

By default, several resources that your program can use are limited. Currently only one queue exists, which is configured to kill jobs which run for more than 168 hours.

Among the resources that a job uses that can be measured and restricted, are the "real time" it runs for, and the cpu time it uses. The current restriction on the (default) queue is that a job will be killed if it runs for more than 168 hours of real time, or more than 168 hours of cpu time. This is an attempt to correctly balance allocation so as not to allow

jobs that use hardly any cpu, but just lock out a job slot (which would be possible if only the cpu time were limited)
jobs that fork many child processes that run in parallel and thus grab more than a fair share of a machine with 4 jobs running on it, the others of which are single running processes. This would be possible if only the real time were limited.

Jobs will received a SIGXCPU signal when they hit 168 hours real or 168 hours cpu time, and a SIGKILL signal when they hit 168hrs10min real or 168hrs10min cpu time.

Matlab

If you want to run a matlab command my_routine, create a script like the following, and run it as before

#!/bin/sh # # the next line is a "magic" comment that tells codine to use bash #$ -S /bin/bash # export DISPLAY="" matlab -nojvm -r my_routine

The DISPLAY="" line stops matlab trying (and failing) to display graphics (though you can still use graphics commands and print the results into a file). The -nojvm option turns off unneeded facilities. Note that after -r you don't put a filename - for example, something like matlab -nojvm -r project/test1.m won't work, though cd project; matlab -nojvm -r test1 should. Note also that by default matlab will be run from your home directory, so any files produced will be there too, by default.

Make sure you understand matlab's notion of "path" otherwise the routines you call might not be found. If you type "path" inside matlab you'll get a list of directories where matlab will look for routines. It will also look in the current directory. The output that normally goes to the command window can be saved in a file. See the Grid Engine from the command line page for details.

ABAQUS

See submitting ABAQUS jobs to gridengine

Graphical interface: qmon

controls Grid Engine provides a graphical interface (see right) to monitoring queues, submitting jobs, and so on. This can be started by running:

        qmon

The iconic representations of the different functions aren't easily interpretable, but hovering the mouse over them gives useful pop-up descriptions. You'll find options to list available machines, submit jobs in other ways, etc. For example, the "Submit Jobs" panel offers these advanced options so that you can choose when to be mailed the job's progress (and control many other things too)

Command line interface

For some purposes it is far easier to the use command line to control and monitor Grid Engine tasks. For example

```
        qsub -m be ./scriptname 
```
means that you'll be e-mailed when your job begins and ends.
```
        qhost
```
tells you about the participating machines.
```
        qstat -f -u "*" 
```
shows you what state jobs are in.

See the Grid Engine from the command line page for more extensive information.

Troubleshooting

Several things can go wrong when using Grid Engine. It's worth checking a simple case to confirm that the mechanism is working at all. Note that except for having more time, your jobs have no extra priviledges (no extra RAM or disc space, for example) so if the script you're going to run with GridEngine doesn't start when you run it normally from the command line, it's not going to work with Grid Engine.

The output and error message that you'd normally see in the terminal window are by default put into files with the same name as the program you're running, but with added suffices - .e for errors and .o for output - followed by a number. Look in these files for clues.

If your program isn't producing output, check that
- the program's ever starting - it's easy to make typing errors when naming your program.
- you're looking in the right directory for output files - by default they'll be in your home folder
- the program isn't running out of time.
- the program doesn't require interaction or a screen to produce graphics. If you want to simulate the graphic-less environment of Grid Engine, start a new xterm window, type "unset DISPLAY", and then run your program.
Don't expect output files to be updated as immediately as they would be if you were running the program live - updating might be done in bursts. In most languages you can force the file to be updated by flushing the output buffers (using fflush in C)
If you want to monitor your program's progress use qstat -f to find out where your program is running, login to that machine, then type top to get an idea of what's happening. The output of top is updated every few seconds. It shows some machine status information then a league-table of process activity - e.g.
last pid: 21455; load averages: 0.65, 0.76, 0.74 11:37:37 239 processes: 229 sleeping, 10 running CPU: 0.0% usr, 6.6% nice, 0.4% sys, 93.0% idle, 0.0% block, 0.0% intr Memory: Real: 187M/307M act/tot Virtual: 231M/397M act/tot Free: 444M TTY PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND ttyq8 15128 tpl 152 4 5871K 7661K run 4:07 0.06% mozilla-bin ...

Don't worry if the state is sleep - processes are likely to sleep from time to time. A stopped process has been suspended and will continue running when it has a chance. The SIZE, RES (short for "resident size"), TIME and CPU columns are useful. If the top process is using a lot of CPU, has been run for a while, and isn't running under Grid Engine, then it may be a "runaway" process (a process that's gone out of control) or a user may be running the program anti-socially. In either case the operators can take action if you mail them the details.
If your program isn't working as expected, you can wrap a script file (a shell script) around it so that you can collect more diagnostic information and any error messages from your program. An example of such a script is
#!/bin/sh # # the next line is a "magic" comment that tells codine to use bash #$ -S /bin/bash # date hostname df pwd printenv time mylongprogram

If you put this into a file called (for sake of argument) testing, then make it executable by typing chmod u+x testing you can run it by doing
```
  
     qsub -m be  -o outputfile -e errorfile ./testing
```
When the program's finished you'll have a file called outputfile containing useful diagnostic information and another file called errorfile which will display the normal error messages produced by your program (called mylongprogram in this example). Calling "time mylongprogram" rather than "mylongprogram" directly gives you extra information (CPU time spent, etc) in the output file. The mail message you'll get at the end of execution will also contain useful information. If a mail message says something like
```
    QThread::start: thread creation error: Cannot allocate memory
```
or
... Set in error state ... Use "qmod -c jobid" to clear job error state once the problem is fixed.

then forward the mail to sysman.
If after running qsub you get a message saying
Unable to run job: denied: project "defaultproject" does not exist. Exiting.
then forward the mail to sysman.
If your job hasn't started for days because someone has dozens of jobs ahead of yours, it's worth mailing the operators.
If your program stops and you're mailed a message saying something like
```
   Job 308 (scriptname.sh) Aborted
   ...
   Signal = USR1
   ...
   failed assumedly after job because:
   job 308.1 died through signal USR1 (10)
```
it's likely that GridEngine has warned your program (by sending a USR1 signal) that some soft limit has been exceeded, or that a KILL signal will be sent in 60 seconds (presumably for exceeding a hard limit). You might wish to include a signal handler in your code so that you can try to save data on receipt of a USR1 signal.

Documentation on other sites

The main web site about Grid Engine is the one provided by Sun.

© Cambridge University Engineering Dept
Tim Love (tpl)
(from information provided by js138, pjb1008, and jpmg)
Last updated: February 2012