Parallel Programming
Now that computers often have several CPUs, (and people may have several computers) people are more interested in parallel programming. This document will introduce you to 2 free ways to exploit parallelisation - OpenMP and MPI. It draws heavily on the following pages
- OpenMP (a tutorial from Lawrence Livermore National Laboratory)
- OpenMP (Wikipedia)
- MPI (Wikipedia)
- MPI in thirty minutes (Linux Magazine)
Issues
The problem with parallel programming is how to split the task up efficiently. What you want to avoid is
- a lot of communication between the sub-processes (which slows the program down)
- competition for resources (for example, if several processes are writing to the same file or variable, you won't know which sub-process will have the last word). You also want to avoid deadlock, where 2 sub-processes are waiting for each other to do something and end up doing nothing.
At one extreme ("distributed computing") the sub-processes are on different machines and don't share memory. They communicate by "message passing". At the other extreme the sub-processes may be threads in the same process, able to share almost anything. MPI is more oriented towards distributed computing and OpenML more oriented towards situations where sub-processes can access the same memory, but both are flexible to some extent.
You might hope that using 10 CPUs will be 10 times faster than using 1, but there are several reasons why such efficiency is unobtainable
- Parallelisation has overheads, especially if the sub-processes need to interact
- Many programs are hard to parallelise effectively
- Parallelising a program (a chess playing program for example) may prevent the use of other optimisations
CUED
Some research groups have shelves of computers that they use for number-crunching. The Engineering Department's central system has many dual-core machines in the labs, but it also has many Linux Servers with 4 dual-core CPUs. On all of these machines you have access to your files. If you log into one of these and type more /proc/cpuinfo you'll see more details about the CPUs.
It's useful to monitor the usage of CPUs. If you type top you'll see a sort of league table of active processes. The %CPU column by default tells you how many CPUs a process is using (though they might not be using each CPU fully). If you type I in the top window, the units for that column change so that 1 represents full use of a CPU. Type q in the window to quit.
The operating system assigns programs to the CPUs. If you want to run your program on several CPUs, you'll have to do the work yourself, though some programs (e.g. Matlab and make) can do much of the work for you.
Note that we run Grid Engine that lets you run many programs simultaneously outside office hours without you needing to change your program. This may be the best option for you.
OpenMP (Open multi-processing)
This can be used from Fortran and C/C++. Newer versions of g++ support it. All OpenMP programs begin as a single process - the master thread - which executes sequentially as usual until asked to create a team of parallel threads. When these finish, they synchronize and terminate, leaving only the master thread. This approach lets you add passages of parallelisation into existing programs fairly painlessly. Here's an example
#include <omp.h> #include <cstdio> using namespace std; int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n",th_id); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n", nthreads); } } return 0; }
Save this into a file (called prog1.cc say) and type
to compile and run the program. You should get
though the order of the threads may vary.
The #pragma omp parallel private(th_id) line is where the program splits. Each sub-process will have its own private copy of th_id - the id of the thread. The #pragma omp barrier line makes all the threads wait until all other threads reach the same point, then the program continues in parallel. Why 4 threads? That's the default on this machine. Changing the parallel line to #pragma omp parallel num_threads(2) private(th_id) would produce 2 threads.
The following program does something more useful. It sums the dot product of 2 arrays of 1000 terms by getting each sub-process to sum a part of the series, then adding those subtotals together. The programmer has to choose the size of the parts (the chunk size).
#include <omp.h> #include <iostream> #include <cmath> using namespace std; int main() { int i, n, chunk; float a[1000], b[1000],result; // Some initializations, done sequentially n = 1000; chunk = 100; result = 0.0; for (i=0; i < n; i++) { a[i] = 1.0/(i+1); b[i] = 1.0/(i+1); } #pragma omp parallel for default(shared) private(i) schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); std::cout << "Final result=" << result << "so pi=" << sqrt(6*result) <<std::endl; }
Call it prog2.cc and type
to compile and run the program. You should get
The magic is all done in the #pragma line. The schedule(static,chunk) part means that loop iterations are divided into chunks of size 100. Each chunk is assigned to a thread in a round-robin fashion. The reduction(+:result) part means that all the result variables will be reduced to a single one by using +.
There are many variants of these pragmas, but I hope this is enough to give you a flavour of OpenMP programming.
MPI (Message Passing Interface)
MPI programs are parallel from the start. The sub-process are assigned at runtime to machines/CPUs by the agent that starts the MPI program, normally called mpirun or mpiexec. At CUED you'll need to install MPI yourself.
Installing MPI
The following, a slight adaption of the instructions on MPI in thirty minutes, works at CUED, though other sources of MPI exist. These instructions install into a folder called MPI in your home directory.
cd ~ mkdir MPI cd MPI wget http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.5.tar.bz2 tar -xjf openmpi-1.4.5.tar.bz2 cd openmpi-1.4.5 ./configure --prefix=$HOME/MPI --enable-mpirun-prefix-by-default --enable-static make make install
The tar command will take a minute or 2. The configure will take a little longer and is more chatty. The make command takes 10 minutes or so. Once you've done make install you'll have include files in $HOME/MPI/include and libraries in $HOME/MPI/lib. In $HOME/MPI/bin you'll have an mpic++ program to use for compiling, and an mpirun program to run your MPI-based programs.
MPI programming
Put the following into mpi1.cc
#include <cstdio> #include <mpi.h> using namespace std; int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("Process %d on %s out of %d\n", rank, processor_name, numprocs); MPI_Finalize(); }
Compile using
$HOME/MPI/bin/mpic++ -o mpi1 mpi1.cc
and run by doing
$HOME/MPI/bin/mpirun -np 4 ./mpi1
You should get something like
(and maybe some warnings too, with instructions on how to suppress them). Again, the order of the processes can vary (and your machine might not be called khopesh). There are 4 processes because -np 4 specified 4 of them. If you have a text file (called names, say) containing names of computers that you can slogin to without need to provide a password, then
$HOME/MPI/bin/mpirun -hostfile names -np 4 ./mpi1
will run some of the processes on these machines.
We'll now try to calculate pi in the same way as before. Put the following into mpi2.cc
#include <cstdio> #include <cmath> #include <cstdlib> #include <mpi.h> #include <unistd.h> #include <string.h> using namespace std; int main(int argc, char **argv) { int i,n=1000, nthreads; double p_sum,sum; char *cpu_name; int loop_min, loop_max, tid; MPI_Init(&argc,&argv); // get the thread ID MPI_Comm_rank(MPI_COMM_WORLD, &tid); // get the number of threads MPI_Comm_size(MPI_COMM_WORLD, &nthreads); n=1000; sum = 0.0; p_sum = 0.0; cpu_name = (char *)calloc(80,sizeof(char)); gethostname(cpu_name,80); printf("thread %d running on machine = %s\n",tid,cpu_name); loop_min = 1 + (int)((long)(tid + 0) * (long)(n)/(long)nthreads); loop_max = (int)((long)(tid + 1) * (long)(n)/(long)nthreads); printf("thread %d loop_min=%i loop_max=%i\n",tid,loop_min, loop_max); for(i=loop_min;i<loop_max;i++) p_sum += 1.0/(i*i); printf("thread %d partial sum=%f\n", tid,p_sum); MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD); if (tid == 0) printf("sum = %f so pi = %f\n",sum,sqrt(6*sum)); MPI_Finalize(); }
Compile using
$HOME/MPI/bin/mpic++ -o mpi2 mpi2.cc
and run by doing
$HOME/MPI/bin/mpirun -np 4 ./mpi2
You should get something like
Again the programmer has to decide how to split the task up, and again the sub-totals are reduced to a single value. The
MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
command sums (hence MPI_SUM) the doubles (hence MPI_DOUBLE) pointed to by &p_sum and puts the answer where &sum points. The 0 indicates that thread 0 (the master thread) wants this to be done.