|
|
|||
![]() |
Department of Engineering |
| University of Cambridge > Engineering Department > computing help > Languages > C++ |
Contents
Now that computers often have several CPUs, (and people may have several computers) people are more interested in parallel programming. This document will introduce you to 2 free ways to exploit parallelisation - OpenMP and MPI. It draws heavily on the following pagesAt one extreme ("distributed computing") the sub-processes are on different machines and don't share memory. They communicate by "message passing". At the other extreme the sub-processes may be threads in the same process, able to share almost anything. MPI is more oriented towards distributed computing and OpenML more oriented towards situations where sub-processes can access the same memory, but both are flexible to some extent.
You might hope that using 10 CPUs will be 10 times faster than using 1, but there are several reasons why such efficiency is unobtainable
Some research groups have shelves of computers that they use for number-crunching. The Engineering Department's central system has many dual-core machines in the labs, but it also has many Linux Servers with 4 dual-core CPUs. On all of these machines you have access to your files. If you log into one of these and type more /proc/cpuinfo you'll see more details about the CPUs.
It's useful to monitor the usage of CPUs. If you type top you'll see a sort of league table of active processes. The %CPU column by default tells you how many CPUs a process is using (though they might not be using each CPU fully). If you type I in the top window, the units for that column change so that 1 represents full use of a CPU. Type q in the window to quit.
The operating system assigns programs to the CPUs. If you want to run your program on several CPUs, you'll have to do the work yourself, though some programs (e.g. Matlab and make) can do much of the work for you.
Note that we run Grid Engine that lets you run many programs simultaneously outside office hours without you needing to change your program. This may be the best option for you.
This can be used from Fortran and C/C++. Newer versions of g++ support it.
All OpenMP programs begin as a single process - the master thread - which executes sequentially as usual until asked to create a team of parallel threads. When these finish, they synchronize and terminate, leaving only the master thread. This approach lets you add passages of parallelisation into existing programs fairly painlessly. Here's an example
#include <omp.h>
#include <cstdio>
using namespace std;
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n",th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n", nthreads);
}
}
return 0;
}
Save this into a file (called prog1.cc say) and type
The #pragma omp parallel private(th_id) line is where the program splits. Each sub-process will have its own private copy of th_id - the id of the thread. The #pragma omp barrier line makes all the threads wait until all other threads reach the same point, then the program continues in parallel. Why 4 threads? That's the default on this machine. Changing the parallel line to #pragma omp parallel num_threads(2) private(th_id) would produce 2 threads.
The following program does something more useful. It sums the dot product of 2 arrays of 1000 terms by getting each sub-process to sum a part of the series, then adding those subtotals together. The programmer has to choose the size of the parts (the chunk size).
#include <omp.h>
#include <iostream>
#include <cmath>
using namespace std;
int main() {
int i, n, chunk;
float a[1000], b[1000],result;
// Some initializations, done sequentially
n = 1000;
chunk = 100;
result = 0.0;
for (i=0; i < n; i++)
{
a[i] = 1.0/(i+1);
b[i] = 1.0/(i+1);
}
#pragma omp parallel for default(shared) private(i) schedule(static,chunk) reduction(+:result)
for (i=0; i < n; i++)
result = result + (a[i] * b[i]);
std::cout << "Final result=" << result << "so pi=" << sqrt(6*result) <<std::endl;
}
Call it prog2.cc and type
There are many variants of these pragmas, but I hope this is enough to give you a flavour of OpenMP programming.
MPI programs are parallel from the start. The sub-process are assigned at runtime to machines/CPUs by the agent that starts the MPI program, normally called mpirun or mpiexec.
At CUED you'll need to install MPI yourself.
The following, a slight adaption of the instructions on MPI in thirty minutes, works at CUED, though other sources of MPI exist. These instructions install into a folder called MPI in your home directory.
cd ~ mkdir MPI cd MPI export http_proxy="www-proxy.eng.cam.ac.uk:3128" wget http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.bz2 tar -xjf openmpi-1.4.3.tar.bz2 cd openmpi-1.4.3 ./configure --prefix=$HOME/MPI --enable-mpirun-prefix-by-default --enable-static make make installThe tar command will take a minute or 2. The configure will take a little longer and is more chatty. The make command takes 10 minutes or so. Once you've done make install you'll have include files in $HOME/MPI/include and libraries in $HOME/MPI/lib. In $HOME/MPI/bin you'll have an mpic++ program to use for compiling, and an mpirun program to run your MPI-based programs.
#include <cstdio>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
MPI_Finalize();
}
Compile using
$HOME/MPI/bin/mpic++ -o mpi1 mpi1.cc(on our linux servers you may need to type
export CPLUS_INCLUDE_PATH=$HOME/MPI/include export LIBRARY_PATH=$HOME/MPI/libfirst) and run by doing
$HOME/MPI/bin/mpirun -np 4 ./mpi1You should get something like
$HOME/MPI/bin/mpirun -hostfile names -np 4 ./mpi1
will run some of the processes on these machines.
We'll now try to calculate pi in the same way as before. Put the following into mpi2.cc
#include <cstdio>
#include <cmath>
#include <cstdlib>
#include <mpi.h>
#include <unistd.h>
#include <string.h>
using namespace std;
int main(int argc, char **argv)
{
int i,n=1000, nthreads;
double p_sum,sum;
char *cpu_name;
int loop_min, loop_max, tid;
MPI_Init(&argc,&argv);
// get the thread ID
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
// get the number of threads
MPI_Comm_size(MPI_COMM_WORLD, &nthreads);
n=1000;
sum = 0.0;
p_sum = 0.0;
cpu_name = (char *)calloc(80,sizeof(char));
gethostname(cpu_name,80);
printf("thread %d running on machine = %s\n",tid,cpu_name);
loop_min = 1 + (int)((long)(tid + 0) * (long)(n)/(long)nthreads);
loop_max = (int)((long)(tid + 1) * (long)(n)/(long)nthreads);
printf("thread %d loop_min=%i loop_max=%i\n",tid,loop_min, loop_max);
for(i=loop_min;i<loop_max;i++)
p_sum += 1.0/(i*i);
printf("thread %d partial sum=%f\n", tid,p_sum);
MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if (tid == 0)
printf("sum = %f so pi = %f\n",sum,sqrt(6*sum));
MPI_Finalize();
}
Compile using
$HOME/MPI/bin/mpic++ -o mpi2 mpi2.ccand run by doing
$HOME/MPI/bin/mpirun -np 4 ./mpi2You should get something like
MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
command sums (hence MPI_SUM) the doubles (hence MPI_DOUBLE) pointed to by &p_sum and puts the answer where &sum points. The 0 indicates that thread 0 (the master thread) wants this to be done.
| | Languages | Grid Engine | Matlab and parallel computing | computing help | |