Department of Engineering

IT Services

Faster C++

This document starts by giving some background information before mentioning some methods that can deliver big (order of magnitude) improvements. Then small-scale efficiency tricks will be covered.

How computers work

Memory

Memory matters to the speed of a program. Improvements in memory speed haven't kept pace with CPU improvements - the fastest CPU operation can take the time light travels a foot. Nowadays it may be faster to do a calculation again than to store the answer for later use.

There's a memory hierarchy -

  • CPU, registers
  • caches (multilevel)
  • RAM, swap
  • Disk, mass storage

The smallest, fastest memory is in the CPU. Much more memory is on the disc (mass storage), but it's slower. One experiment determined that the CPU is 8* faster than the L2 cache which is 256* faster than RAM which is 80* faster than Disc, but every machine's different.

Best case? - keep the CPU busy by using fast memory, where recently used values, or memory adjacent to those values, might be stored.

Worst case? - keep jumping around in memory.

CPU

Nowadays there are "multi-core" CPUs, and computers can have more than one CPU (our Linux Servers each have 4 quad-core CPUs). Unless you're a low-level programmer though, you won't be able to take explicit advantage of this potential for parallel programming. See the Parallel Programming page if you're interested.

screendumpOn the teaching system you can get a rough idea of a programs' CPU and memory usage by running the Computer panel's system monitor. Here's the CPU part of the monitor's output for a DPO machine

Especially for specialised number-crunching areas, power can be obtained fairly cheaply. Some groups in the department use big clusters of cheap machines to run programs, but unless you belong to those groups you won't be able to use these machines. Alternatively, in the Guardian (Feb 2008) an academic said he could run a simulation 130 times faster on a PlayStation 3 (which has several CPUs/accelerators) than on his desktop PC. Some CUED people use CUDA (a library that uses an nVidia graphics board) to get 10x speed-up. C++ AMP (Accelerated Massive Parallelism) is a set of language extensions to standard C++ to take advantage of GPUs. They excel at highly parallel tasks (without much thread divergence) that emphasize single-precision floating point calculations over memory access and work best when threads all access the same memory locations or there's a regular small spacing between memory locations accessed by different threads.

Basic principles

  • A very common piece of advice is "Delay optimization as much as possible, and don't do it if you can avoid it.", Donald Knuth said, "premature optimization is the root of all evil."
  • A very common piece of advice is "Measure to find out where the bottleneck is, and optimise that."
  • Select appropriate data structures.
  • Pick sensible algorithms - if you're sorting or searching, don't just use the first routine you come across.
  • I/O is slow

Profiling

Don't assume that calculations with floats are faster than calculations with doubles - or even with ints. Don't even assume that it's faster to store a value for future use than to recalculate it. Profile your code in as realistic conditions as possible.

With g++ you'll need to compile using the "+g" option, run your program (called snail for example), then run prof snail. You'll get output looking a little like this

 %Time Seconds Cumsecs  #Calls   msec/call  Name

  32.4    0.35    0.35                      __strlen20
  23.1    0.25    0.60 5770000        0.00  _strlen
  20.4    0.22    0.82   10000        0.02  slow
  11.1    0.12    0.94   10000        0.01  fast
  10.2    0.11    1.05                      _mcount
   1.4    0.01    1.06                      _monstartup

Containers

They make life easy, offering more than one way of doing something, but if used thoughtlessy, the resulting code will be slow.

  • If a container provides a member function that duplicates a generic STL algorithm it is likely that the member function is more efficient. In the following, the 1st loop takes 50% longer when I run it.
    #include <iostream>
    #include <string>
    #include <sys/time.h>
    using namespace std;
    
    long microseconds() {
      struct timeval tv;
      gettimeofday(&tv,0);
      return  1000000*tv.tv_sec + tv.tv_usec;
    }
    
    int main() {
      string a="hello";
      string b="goodbye";
    
      long t1=microseconds();
      for (int i=0;i<10000;i++) {
        swap(a,b);
      }
    
      cout << microseconds() - t1 << endl;
    
      t1=microseconds();
      for (int i=0;i<10000;i++) {
        a.swap(b);
      }
      cout << microseconds() - t1 << endl;
    }
    

    However, the 2nd loop in the next example takes twice as long as the 1st loop.

    #include <iostream>
    #include <string>
    #include <set>
    #include <algorithm>
    #include <sys/time.h>
    using namespace std;
    
    long microseconds() {
      struct timeval tv;
      gettimeofday(&tv,0);
      return  1000000*tv.tv_sec + tv.tv_usec;
    }
    
    int main() {
    string A[10000];
     for (int  i=0;i<10000;i++) {
       A[i]="hello";
     }
     set<string> setA(A, A+10000); 
     set<string>::iterator it1;
    
      long t1=microseconds();
      for (int i=0;i<10000;i++) {
          it1=find(setA.begin(), setA.end(),"goodbye");
      }
      cout << microseconds() - t1 << endl;
    
      t1=microseconds();
      for (int i=0;i<10000;i++) {
           it1=setA.find("goodbye");
      }
      cout << microseconds() - t1 << endl;
    }
    

Note that container-based code responds well to optimisation.

Compiler options

It's important to tell the compiler as much as you can about your code so that it can optimise well. Use `const' keywords whenever you can. See man pages for compiler directives. In g++ there are over 200 of them - e.g. -fbranch-probabilities, -fexpensive-optimizations -ffast-math, -fno-trapping-math, -funsafe-math-optimizations, -freorder-functions, -funroll-all-loops. Using +O4 does aggressive, time-consuming optimisation.

New compiler releases can have greatly improved optimisers. The simpler the code, the better chance the compiler has of optimising it - which is why it doesn't alway pay off to try the clever tricks later in this document. The table in the Nasty tricks shows what speed-ups can be achieved. Order of magnitude speed-ups are possible.

Cache-friendliness

Try this code. The second loop is 50% slower than the first on a Mac and Dell that I tried.

#include <iostream>
#include <vector>
#include <sys/time.h>
using namespace std;

// This function returns microseconds elapsed since the dawn of computing
// (1st, Jan, 1970). Don't worry about how it works
long microseconds() {
  struct timeval tv;
  gettimeofday(&tv,0);
  return  1000000*tv.tv_sec + tv.tv_usec;
}

  int a[1000][1000];
  int b[1000][1000];
  int c[1000][1000];
  int d[1000][1000];

int main(){
  int tmp;
  long t1=microseconds();
  for (int i=0;i<1000;i++)
    for (int j=0;j<1000;j++)
      tmp=a[i][j]*b[i][j];
  cout << microseconds()-t1 << endl;

  t1=microseconds();
  for (int i=0;i<1000;i++)
    for (int j=0;j<1000;j++)
      tmp=c[j][i]*d[j][i];
  cout << microseconds()-t1 << endl;
}

The difference is that loop1 goes through the matrix row by row rather than column by column. The matrix is stored row by row, so loop1 accesses successive memory addresses. Since data transfers are done in blocks, this will involve fewer block transfers.

Function calls

Arguments

Suppose you have

int a=7;
a=triple(a);

where the triple function is

int triple(int x) {
   return 3*x;
}

When "triple" is called, the value of the "a" argument is copied to "x", then at the end of the function the return value is copied into "a". That's 2 copies that would be avoided were references used -

triple(a);
...
void fun(int& x) {
...
   x*=3;
}

In this example only an integer is copied, but if there are many big objects, the extra copying can be expensive. Using optimisation in the code below, the 2nd loop is 50 times faster than the 1st.

#include <iostream>
#include <string>
#include <sys/time.h>
using namespace std;

long microseconds() {
  struct timeval tv;
  gettimeofday(&tv,0);
  return  1000000*tv.tv_sec + tv.tv_usec;
}

string valuecrossout(string s) {
  s[0]='x';
  s[1]='x';
  return s;
}

void refcrossout(string& s) {
  s[0]='x';
  s[1]='x';
}

int main() {
  string a="hello";

  long t1=microseconds();
  for (int i=0;i<10000;i++) {
   a= valuecrossout(a);
  }
  cout << microseconds() - t1 << endl;

  t1=microseconds();
  for (int i=0;i<10000;i++) {
     refcrossout(a);
  }
  cout << microseconds() - t1 << endl;
}

Virtual functions

These are slower to call though may lead to neater code.

Coding

Once you get down to the level of code, there are trade-offs - Speed/Size/Readability

  • Loops - Deeply nested loops merit particular attention.
    • Do nothing unnecessary in the loop.
    • Keep loops small (some machines have an `instruction cache' (the DPO machines have 16k of data cache and 12k of execution cache (L1) per core) so that recent instructions can be quickly recalled. If a loop fits inside the cache, you'll benefit.
    • as a condition compare with 0.
    • 'unroll' them (i.e. use a list instead of a loop)?
  • exceptions and virtuals are slower. Adding "throw()" just before the body of a function tells the compiler that the function won't use exceptions, making the code perhaps 1% faster.
  • Create variables as late as possible and initialise them at creation time - this
    int max=0;
    if (x==7) {
      Person p(23);
      ...
      max=p.age;
    }
    
    is preferable to this
    Person p;
    int max;
    max=0;
    
    if (x==7) {
      p.age=23;
      ...
      max=p.age;
    }
    
  • Passing parameters - If you're passing very large items by value, memory requirements (and hence time requirements) increase. It might be helpful to make very big data-structures global.
  • Avoid recursion.
  • Arithmetic - See If you're doing lots of calculations you'll benefit from reading about Numerical Analysis. It shows you how to solve problems as quickly and accurately as possible - e.g. adding the same numbers in a different order may give a more accurate answer - (x+y) + z doesn't necessarily equal x + (y+z). For speed
    • Use look-up tables instead of function calls for sin(x), etc.
    • Use bit-shift rather than + or - rather than * or /.
    Don't assume that adding is faster than multiplying. With this code -
    #include <iostream>
    #include <sys/time.h>
    using namespace std;
    
    // This function returns microseconds elapsed since the dawn of computing
    // (1st, Jan, 1970). Don't worry about how it works
    long microseconds() {
      struct timeval tv;
      gettimeofday(&tv,0);
      return  1000000*tv.tv_sec + tv.tv_usec;
    }
    
    int a[10000][10000];
    int b[10000][10000];
    
    int main(){
      for (int i=0;i<10000;i++)
        for (int j=0;j<10000;j++)
          a[i][j]=b[i][j]=i+j*i;
    
      int tmp;
      long t1=microseconds();
      for (int i=0;i<10000;i++)
        for (int j=0;j<10000;j++)
          tmp=a[i][j]*b[i][j];
      cout << "Multiplication "<< microseconds()-t1 << endl;
    
      t1=microseconds();
      for (int i=0;i<10000;i++)
        for (int j=0;j<10000;j++)
          tmp=a[i][j]+b[i][j];
      cout <<"Adding "<< microseconds()-t1 << endl;
    
    
     t1=microseconds();
      for (int i=0;i<10000;i++)
        for (int j=0;j<10000;j++)
          tmp=a[i][j]*b[i][j];
      cout  << "Multiplication "<< microseconds()-t1 << endl;
      t1=microseconds();
      for (int i=0;i<10000;i++)
        for (int j=0;j<10000;j++)
          tmp=a[i][j]+b[i][j];
      cout <<"Adding "<< microseconds()-t1 << endl;
    }
    
    I got
    Multiplication 258647
    Adding 269356
    Multiplication 258875
    Adding 269829
    
    without optimising. With optimisation I got
    Multiplication 30606
    Adding 31062
    Multiplication 30756
    Adding 30799
    
    Using floats instead of ints made no difference.

    With this code

    #include <iostream>
    #include <sys/time.h>
    using namespace std;
    
    long microseconds();
    
    long microseconds() {
      struct timeval tv;
      gettimeofday(&tv,0);
      return  1000000*tv.tv_sec + tv.tv_usec;
    };
    
    long time1;
    
    // Results acculumated and printed to ensure that the compiler won't 
    // optimise away unnecessary calculations 
    
    int main() {
     int i1=3,i2=4,i3;
     float f1=3,f2=4,f3;
     time1=microseconds();
     for (int i=0;i<10000;i++)
       for(int j=0;j<10000;j++)
         i3+=i1+i2;
     cout << "Time to add ints:" << microseconds() - time1 << ". i3="<< i3 << endl;
    
     time1=microseconds();
     for (int i=0;i<10000;i++)
       for(int j=0;j<10000;j++)
         f3+=f1*f2;
     cout << "Time to multiply floats:" << microseconds() - time1 << ". f3="<< f3  << endl;
    }
    

    I got this with no optimisation

    Time to add ints:225330. i3=700000000
    Time to multiply floats:302071. f3=2.68435e+08
    

    , this using "-O"

    Time to add ints:48830. i3=700000000
    Time to multiply floats:312950. f3=2.68435e+08
    

    and this using "-O"

    Time to add ints:0. i3=700000000
    Time to multiply floats:37905. f3=1.07374e+09
    
  • Malloc
    • If you're using malloc a lot, look up the options to tune malloc.
  • Reduce `paging' by keeping program small or at least stop it nonsequentially accessing memory.

Nasty Tricks

See C++: Nasty Tricks if you're desperate.

See Also