 |
Department of Engineering |
 |
 |
Faster, longer programs, and optimisation
I'll deal with general issues first, then continue towards the particular. Some of the easier things to do
also result in the biggest improvements
How computers work
Memory
There's a memory hierarchy - the smallest, fastest memory is in the CPU.
Much more memory is on the disc, but it's slower
- CPU, registers
- L2 cache
- RAM, swap
- Disk, disk cache
Very roughly speaking, the CPU is 8* faster than the L2 cache which is 256* faster than RAM which is 80* faster than Disc.
Best case? (Doing the same operations in a different order may triple the speed of calculation).
Worst case?
CPU
Nowadays there are "dual-core" CPUs, and computers can have more than one
CPU (our Linux Servers each have 4). Unless you're a low-level programmer
though, you won't be able to take explicit advantage of this potential
for parallel programming (unless you're a matlab user - see Matlab Parallelisation). See the Parallel Programming page if you're interested.
If you're writing programs that use huge amounts
of memory, it's worth knowing whether the CPU's are "32 bit" or "64 bit".
The biggest memory address that a "32 bit" CPU can access is about 4G at
the very best, so there's no chance of having arrays that big. The memory
range of a "64 bit" CPU is far greater - the teaching system machines have 64 bit CPUs.
On the teaching system you can get a rough idea of a programs' CPU and memory
usage by running the Computer panel's system monitor
Especially for specialised number-crunching areas, power can be
obtained fairly cheaply.
Some groups in the department use big clusters of cheap machines
to run programs, but unless you belong to those groups you won't be
able to use these machines. Alternatively, in the Guardian (Feb 2008)
an academic said he could run a simulation 130 times faster on a
PlayStation 3 (which has several CPUs/accelerators) than on his desktop PC.
Some CUED people use CUDA (a library that uses an nVidia board) to get
10x speed-up. There's a MATLAB plug-in for CUDA
Big Issues
Do estimates of time/space requirements as early as possible. Ask yourself
- Can I find a faster machine? - On the
Teaching System there's little choice?
Use uname -a
on a Unix machine to find out machine type.
- Do I need to run so many big programs? (phoenix)
- Have I made the most of the compiler's optimisation options? - (why not all the time? Slow, risky with bad code). Type "man g++"
on linux to see what optimisation is done.
- Am I using the fastest Algorithms and most suitable data structures? - sorting, lists, etc;
num of operations O(nlogn);
cache-friendly?
- Is it too late to change language? - (Java? Matlab compilation?). Mixing languages? Not all of a program is going to be number-crunching - you need to get the
data in, call other programs, and display the results. The computing
service suggest python as the "glue" language. At CUED we also have
matlab, which although not free has good graphics capabilities and
thousands of functions. Both matlab and python may be fast enough to
do all but the most intensive parts of your work.
You may well use Fortran or C++ for the number crunching work - see
Identifying the bottleneck
Slow graphics? Slow disc access? Is a particular routine slow? On Unix,
typing top will give some statistics about programs that are running
(how big they are, how much CPU they're using). To get fine-grain information you
need to use a "profiler". On matlab this is easy. If your main program is
called snail, then typing the following inside matlab -
profile on
snail
profile plot
profile off
will display a histogram showing a function-by-function breakdown of performance.
With fortran or C++ on Unix, you'll need to compile using the "+g" option,
run your program (called snail for example), then run prof snail. You'll get output looking a little like this
%Time Seconds Cumsecs #Calls msec/call Name
32.4 0.35 0.35 __strlen20
23.1 0.25 0.60 5770000 0.00 _strlen
20.4 0.22 0.82 10000 0.02 slow
11.1 0.12 0.94 10000 0.01 fast
10.2 0.11 1.05 _mcount
1.4 0.01 1.06 _monstartup
Coding
Trade-offs - Speed/Size/Readability
- Loops - Deeply nested loops merit particular attention.
- Do nothing unnecessary in the loop.
- Keep loops small (some machines have an `instruction cache' (the DPO machines have 16k of data cache and 12k of execution cache (L1) per core)
so that recent instructions can be quickly recalled. If a
loop fits inside the cache, you'll benefit.
- Access arrays sequentially in memory (there's a data cache too,
and paging to consider).
- as a condition compare with 0.
- `unroll' them (i.e. use a list instead of a loop)?
- 2-D arrays - row/column major (C++ is row-major)
- Passing parameters - If you're passing very large items by value,
memory requirements (and hence time requirements) increase. It might be
helpful to make very big data-structures global.
- Avoid recursion.
- Arithmetic - See
If you're doing lots of calculations you'll benefit from reading about Numerical Analysis. It shows you how to solve problems as quickly and accurately as possible - e.g. adding the same numbers in a different order may give a more accurate answer - (x+y) + z doesn't
necessarily equal x + (y+z). For speed
- Use look-up tables. sin(x)
- Use bit-shift rather than + or - rather than * or /.
- Malloc
- If you're using malloc a lot, look up the options to tune malloc.
- Reduce `paging' by keeping program small or at least stop it
nonsequentially accessing memory.
- Matlab? Sparse Arrays; Vectorisation; selective rewriting as mex files
- It's
important to tell the compiler as much as you can about your code
so that it can optimise well. For C++, use `const'
keywords whenever you can. See man pages for compiler directives.
There are many - e.g. -fbranch-probabilities, -fexpensive-optimizations -ffast-math, -fno-trapping-math, -funsafe-math-optimizations, -freorder-functions, -funroll-all-loops. Using +O4 does aggressive, time-consuming
optimisation.
Space
The smaller you make your data structures (and the more localised the access to them)
the better. If your process (i.e. the running program) grows too big, if will run
slowly and may become too big to run. Asking for more disk quota won't help.
See Memory Optimisation for system-specific and language-specific details.
Running long jobs
- GridEngine is on our linux servers
- reaper kills off unattended programs after an hour or so.
- When writing programs for extended use
- Use checkpoints
- Make them non-interactive
See Also