|
|||
Department of Engineering | |
University of Cambridge > Engineering Department > Computing Help > Unix |
If you want to run programs while you aren't logged in, you can use our Grid Engine facility. This lets you use the CPU-power of CUED Teaching System Linux-based machines when they're not used for other tasks. Outside term, several machines are available all day. During term, machines are likely to run your programs only at weekends and on weekdays between 6pm and 9am, but you can submit your programs at any time. At its simplest you
#!/bin/sh #$ -S /bin/bash ./myprogramname
chmod u+x forthegrid source /usr/local/apps/gridengine/blades/common/settings.sh qsub -m be ./forthegrid
You'll be mailed when your program starts and finishes. Your program will produce the files it usually produces. Text output will go into a file called forthegrid.o*. Most of the rest of this page tells you what to do if things go wrong.
Some programs will work under Grid Engine with no extra work. Others may require recompiling or rewriting. Many will run much faster if a little thought is given to optimising the code. Once programs run for days, even an improvement of a few percent becomes significant. See
for ways of speeding your programs up.
Under Grid Engine your program will be run by default in your home directory, and your PATH (the list of places where programs are looked for) will be /usr/local/bin:/bin:/usr/bin. To start your main program in a different directory, use cd in the script file - see below - or use qsub -cwd ... instead of qsub .... The "-cwd" option makes your program run in the same directory that qsub was run in. You can add directories to the PATH too.
If your program requires interaction you'll have to rewrite it so that interaction isn't required. See the Command line options section for help.
Note that your program won't run much faster than it would on your own machine, but Grid Engine lets you run many programs at once, so if you structure your work appropriately you can increase your work-rate by an order of magnitude or so.
You can use Grid Engine only from the Teaching System's Linux Servers, so you'll need to log into one of those before doing anything else. You'll then need to set your environment up by running
source /usr/local/apps/gridengine/blades/common/settings.shif your login shell is bash (the default), or by running
source /usr/local/apps/gridengine/blades/common/settings.csh
if your login shell is csh or tcsh.
You can cause this initialisation to always happen when you log in by editing the appropriate shell initialisation file (as documented in various places including CUED's shell script page).
In order to run jobs, you will need to write a shell script (even if the program you actually want to run is a compiled program, Grid Engine insists on a script being submitted; it need do no more than call your program directly).
A typical minimal script (with comments) would be
You need to make this script executable - see the Unix groups and file permissions page, or just type
chmod u+x scriptname
Now you can submit a job on the queue:
qsub ./scriptname
You should get a response like
Your job 24 ("scriptname") has been submitted.
The number (in this case 24) is the "job ID", which you'll need if you're going to cancel the job or report a bug. If there's a machine free to run your job, execution will begin straight away. Otherwise your job joins the queue of all the other jobs that are pending.
By default, several resources that your program can use are limited. Currently only one queue exists, which is configured to kill jobs which run for more than 168 hours.
Among the resources that a job uses that can be measured and restricted, are the "real time" it runs for, and the cpu time it uses. The current restriction on the (default) queue is that a job will be killed if it runs for more than 168 hours of real time, or more than 168 hours of cpu time. This is an attempt to correctly balance allocation so as not to allow
Jobs will received a SIGXCPU signal when they hit 168 hours real or 168 hours cpu time, and a SIGKILL signal when they hit 168hrs10min real or 168hrs10min cpu time.
If you want to run a matlab command my_routine, create a script like the following, and run it as before
The DISPLAY="" line stops matlab trying (and failing) to display graphics (though you can still use graphics commands and print the results into a file). The -nojvm option turns off unneeded facilities. Note that after -r you don't put a filename - for example, something like matlab -nojvm -r project/test1.m won't work, though cd project; matlab -nojvm -r test1 should. Note also that by default matlab will be run from your home directory, so any files produced will be there too, by default.
Make sure you understand matlab's notion of "path" otherwise the routines you call might not be found. If you type "path" inside matlab you'll get a list of directories where matlab will look for routines. It will also look in the current directory. The output that normally goes to the command window can be saved in a file. See the Grid Engine from the command line page for details.
See submitting ABAQUS jobs to gridengine
Grid Engine provides a graphical interface (see right) to monitoring queues, submitting jobs, and so on. This can be started by running:
qmon
The iconic representations of the different functions aren't easily interpretable, but hovering the mouse over them gives useful pop-up descriptions. You'll find options to list available machines, submit jobs in other ways, etc. For example, the "Submit Jobs" panel offers these advanced options so that you can choose when to be mailed the job's progress (and control many other things too)
For some purposes it is far easier to the use command line to control and monitor Grid Engine tasks. For example
qsub -m be ./scriptnamemeans that you'll be e-mailed when your job begins and ends.
qhosttells you about the participating machines.
qstat -f -u "*"shows you what state jobs are in.
See the Grid Engine from the command line page for more extensive information.
Several things can go wrong when using Grid Engine. It's worth checking a simple case to confirm that the mechanism is working at all. Note that except for having more time, your jobs have no extra priviledges (no extra RAM or disc space, for example) so if the script you're going to run with GridEngine doesn't start when you run it normally from the command line, it's not going to work with Grid Engine.
The output and error message that you'd normally see in the terminal window are by default put into files with the same name as the program you're running, but with added suffices - .e for errors and .o for output - followed by a number. Look in these files for clues.
Don't worry if the state is sleep - processes are likely to sleep from time to time. A stopped process has been suspended and will continue running when it has a chance. The SIZE, RES (short for "resident size"), TIME and CPU columns are useful. If the top process is using a lot of CPU, has been run for a while, and isn't running under Grid Engine, then it may be a "runaway" process (a process that's gone out of control) or a user may be running the program anti-socially. In either case the operators can take action if you mail them the details.
If you put this into a file called (for sake of argument) testing, then make it executable by typing chmod u+x testing you can run it by doing
qsub -m be -o outputfile -e errorfile ./testing
When the program's finished you'll have a file called outputfile containing useful diagnostic information and another file called errorfile which will display the normal error messages produced by your program (called mylongprogram in this example). Calling "time mylongprogram" rather than "mylongprogram" directly gives you extra information (CPU time spent, etc) in the output file. The mail message you'll get at the end of execution will also contain useful information. If a mail message says something like
QThread::start: thread creation error: Cannot allocate memoryor
then forward the mail to sysman.
Job 308 (scriptname.sh) Aborted ... Signal = USR1 ... failed assumedly after job because: job 308.1 died through signal USR1 (10)it's likely that GridEngine has warned your program (by sending a USR1 signal) that some soft limit has been exceeded, or that a KILL signal will be sent in 60 seconds (presumably for exceeding a hard limit). You might wish to include a signal handler in your code so that you can try to save data on receipt of a USR1 signal.
The main web site about Grid Engine is the one provided by Sun.