Department of Engineering

IT Services

Parallel Programming

Now that computers often have several CPUs, (and people may have several computers) people are more interested in parallel programming. This document will introduce you to 2 free ways to exploit parallelisation - OpenMP and MPI. It draws heavily on the following pages

Issues

The problem with parallel programming is how to split the task up efficiently. What you want to avoid is

  • a lot of communication between the sub-processes (which slows the program down)
  • competition for resources (for example, if several processes are writing to the same file or variable, you won't know which sub-process will have the last word). You also want to avoid deadlock, where 2 sub-processes are waiting for each other to do something and end up doing nothing.

At one extreme ("distributed computing") the sub-processes are on different machines and don't share memory. They communicate by "message passing". At the other extreme the sub-processes may be threads in the same process, able to share almost anything. MPI is more oriented towards distributed computing and OpenML more oriented towards situations where sub-processes can access the same memory, but both are flexible to some extent.

You might hope that using 10 CPUs will be 10 times faster than using 1, but there are several reasons why such efficiency is unobtainable

  • Parallelisation has overheads, especially if the sub-processes need to interact
  • Many programs are hard to parallelise effectively
  • Parallelising a program (a chess playing program for example) may prevent the use of other optimisations

CUED

Some research groups have shelves of computers that they use for number-crunching. The Engineering Department's central system has many dual-core machines in the labs, but it also has many Linux Servers with 4 dual-core CPUs. On all of these machines you have access to your files. If you log into one of these and type more /proc/cpuinfo you'll see more details about the CPUs.

It's useful to monitor the usage of CPUs. If you type top you'll see a sort of league table of active processes. The %CPU column by default tells you how many CPUs a process is using (though they might not be using each CPU fully). If you type I in the top window, the units for that column change so that 1 represents full use of a CPU. Type q in the window to quit.

The operating system assigns programs to the CPUs. If you want to run your program on several CPUs, you'll have to do the work yourself, though some programs (e.g. Matlab and make) can do much of the work for you.

Note that we run Grid Engine that lets you run many programs simultaneously outside office hours without you needing to change your program. This may be the best option for you.

OpenMP (Open multi-processing)

openmpThis can be used from Fortran and C/C++. Newer versions of g++ support it. All OpenMP programs begin as a single process - the master thread - which executes sequentially as usual until asked to create a team of parallel threads. When these finish, they synchronize and terminate, leaving only the master thread. This approach lets you add passages of parallelisation into existing programs fairly painlessly. Here's an example

#include <omp.h>
#include <cstdio>
using namespace std;

int main (int argc, char *argv[]) {
  int th_id, nthreads;
#pragma omp parallel private(th_id)
  {
    th_id = omp_get_thread_num();
    printf("Hello World from thread %d\n",th_id);
    #pragma omp barrier
    if ( th_id == 0 ) {
      nthreads = omp_get_num_threads();
      printf("There are %d threads\n", nthreads);
    }
  }
  return 0;
}

Save this into a file (called prog1.cc say) and type

g++ -fopenmp -o prog1 prog1.cc ./prog1

to compile and run the program. You should get

Hello World from thread 1 Hello World from thread 0 Hello World from thread 2 Hello World from thread 3 There are 4 threads

though the order of the threads may vary.

The #pragma omp parallel private(th_id) line is where the program splits. Each sub-process will have its own private copy of th_id - the id of the thread. The #pragma omp barrier line makes all the threads wait until all other threads reach the same point, then the program continues in parallel. Why 4 threads? That's the default on this machine. Changing the parallel line to #pragma omp parallel num_threads(2) private(th_id) would produce 2 threads.

The following program does something more useful. It sums the dot product of 2 arrays of 1000 terms by getting each sub-process to sum a part of the series, then adding those subtotals together. The programmer has to choose the size of the parts (the chunk size).

#include <omp.h>
#include <iostream>
#include <cmath>

using namespace std;

int main() {

  int   i, n, chunk;
  float a[1000], b[1000],result;

  // Some initializations, done sequentially
  n = 1000;
  chunk = 100;
  result = 0.0;
  for (i=0; i < n; i++)
    {
      a[i] = 1.0/(i+1);
      b[i] = 1.0/(i+1);
    }

#pragma omp parallel for default(shared) private(i) schedule(static,chunk) reduction(+:result)  

   for (i=0; i < n; i++)
     result = result + (a[i] * b[i]);

 std::cout << "Final result=" << result << "so pi="  << sqrt(6*result) <<std::endl;

}

Call it prog2.cc and type

g++ -fopenmp -o prog2 prog2.cc ./prog2

to compile and run the program. You should get

Final result=1.64393 so pi=3.14064

The magic is all done in the #pragma line. The schedule(static,chunk) part means that loop iterations are divided into chunks of size 100. Each chunk is assigned to a thread in a round-robin fashion. The reduction(+:result) part means that all the result variables will be reduced to a single one by using +.

There are many variants of these pragmas, but I hope this is enough to give you a flavour of OpenMP programming.

MPI (Message Passing Interface)

mpiMPI programs are parallel from the start. The sub-process are assigned at runtime to machines/CPUs by the agent that starts the MPI program, normally called mpirun or mpiexec. At CUED you'll need to install MPI yourself.

Installing MPI

The following, a slight adaption of the instructions on MPI in thirty minutes, works at CUED, though other sources of MPI exist. These instructions install into a folder called MPI in your home directory.

  cd ~
  mkdir MPI
  cd MPI
  wget http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.5.tar.bz2
  tar -xjf openmpi-1.4.5.tar.bz2
  cd openmpi-1.4.5
  ./configure --prefix=$HOME/MPI --enable-mpirun-prefix-by-default --enable-static
  make
  make install

The tar command will take a minute or 2. The configure will take a little longer and is more chatty. The make command takes 10 minutes or so. Once you've done make install you'll have include files in $HOME/MPI/include and libraries in $HOME/MPI/lib. In $HOME/MPI/bin you'll have an mpic++ program to use for compiling, and an mpirun program to run your MPI-based programs.

MPI programming

Put the following into mpi1.cc

#include <cstdio>
#include <mpi.h>
using namespace std;

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  MPI_Finalize();
}

Compile using

   $HOME/MPI/bin/mpic++ -o mpi1 mpi1.cc 

and run by doing

   $HOME/MPI/bin/mpirun -np 4 ./mpi1

You should get something like

Process 3 on khopesh out of 4 Process 1 on khopesh out of 4 Process 0 on khopesh out of 4 Process 2 on khopesh out of 4

(and maybe some warnings too, with instructions on how to suppress them). Again, the order of the processes can vary (and your machine might not be called khopesh). There are 4 processes because -np 4 specified 4 of them. If you have a text file (called names, say) containing names of computers that you can slogin to without need to provide a password, then

   $HOME/MPI/bin/mpirun -hostfile names -np 4 ./mpi1

will run some of the processes on these machines.

We'll now try to calculate pi in the same way as before. Put the following into mpi2.cc

#include <cstdio>
#include <cmath>
#include <cstdlib>
#include <mpi.h>
#include <unistd.h>
#include <string.h>
using namespace std;

int main(int argc, char **argv)
  {
    int i,n=1000, nthreads;
    double p_sum,sum;
    char *cpu_name; 

    int loop_min, loop_max, tid;

    MPI_Init(&argc,&argv);	
    // get the thread ID
    MPI_Comm_rank(MPI_COMM_WORLD, &tid);  
    // get the number of threads 
    MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

    n=1000;
    sum	= 0.0;
    p_sum = 0.0;
    cpu_name = (char *)calloc(80,sizeof(char));
    gethostname(cpu_name,80);
    printf("thread %d running on machine = %s\n",tid,cpu_name);
   
    loop_min = 1 +  (int)((long)(tid + 0) *  (long)(n)/(long)nthreads);
    loop_max =      (int)((long)(tid + 1) *  (long)(n)/(long)nthreads); 
    printf("thread %d loop_min=%i loop_max=%i\n",tid,loop_min, loop_max);

    for(i=loop_min;i<loop_max;i++)
	 p_sum += 1.0/(i*i);
    printf("thread %d partial sum=%f\n", tid,p_sum);
    
    MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);   

    if (tid == 0)
       printf("sum = %f so pi = %f\n",sum,sqrt(6*sum));
 
    MPI_Finalize();
  }

Compile using

   $HOME/MPI/bin/mpic++ -o mpi2 mpi2.cc 

and run by doing

   $HOME/MPI/bin/mpirun -np 4 ./mpi2

You should get something like

thread 2 running on machine = khopesh thread 2 loop_min=501 loop_max=750 thread 2 partial sum=0.000664 thread 3 running on machine = khopesh thread 3 loop_min=751 loop_max=1000 thread 3 partial sum=0.000332 thread 0 running on machine = khopesh thread 0 loop_min=1 loop_max=250 thread 0 partial sum=1.640926 sum = 1.643912 so pi = 3.140616 thread 1 running on machine = khopesh thread 1 loop_min=251 loop_max=500 thread 1 partial sum=0.001990

Again the programmer has to decide how to split the task up, and again the sub-totals are reduced to a single value. The

    MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);   

command sums (hence MPI_SUM) the doubles (hence MPI_DOUBLE) pointed to by &p_sum and puts the answer where &sum points. The 0 indicates that thread 0 (the master thread) wants this to be done.

See Also