Department of Engineering

IT Services

C++: Nasty Tricks

You may be prepared to try to speed up your code at the expense of readability. Below are some examples of such code. The speed or memory benefits vary, as does the legibility of the resulting code. In some cases the tricks are slower than the standard methods, so beware!

Bits and Bytes

Rather than use 32 ints to store 32 yes/no values, you can use one 32-bit int, using each bit as a flag. The following code (which has no error checking!) shows how to do this using a new C++ facility (bitset) or an older C method that also works in C++. These methods should produce faster and smaller programs without damaging the legibility, so they're not really nasty. The bit-manipulation methods will be useful later on.

#include <iostream> #include <bitset> using namespace std; void set_a_bit(uint whichbit, uint& variable, int value) { if (value!=0 and value!=1) return; uint mask=1<<whichbit; if (value==1) variable|=mask; else variable&=~mask; } bool get_a_bit(uint whichbit, uint variable) { uint mask=1<<whichbit; uint v= variable & mask; return (v!=0); } int main() { const unsigned int width = 32; bitset<width> b2 = 1023; // here's a C++ method. See the bitset documentation for details cout<< b2.count() << " bits set in b2" << endl; uint foo= 2; // here's a C way of accessing bits cout << "Bit 0 in foo is " << get_a_bit(0,foo) << endl; cout << "Bit 1 in foo is " << get_a_bit(1,foo) << endl; cout << "Bit 2 in foo is " << get_a_bit(2,foo) << endl; set_a_bit(3,foo,1); cout << "foo is now " << foo << endl; return 0; }

Floating point representation

IEEE-754 conformance for floating point arithmetic is very common nowadays. Knowing how floats are stored can help when trying to speed up code, but you'll need to be aware of some special bit-patterns. Here's a program using some of them. The union is used so that ifval.i and ifval.f occupy the same 4 bytes; changing ifval.i will change ifval.f. "0x" is the prefix for hexadecimal numbers.

#include <iostream>

using namespace std;
typedef union {float f; int i; } IntOrFloat;
IntOrFloat ifval;
int main(){

   // Intels slower than AMD when NaN/Inf used. Maybe 900x slower
   // floats and doubles about the same
   cout << "*** Checking bit patterns " << endl;
   ifval.i=0xFFFFFFFF;
   cout << "0xFFFFFFFF=" <<  ifval.f << " (should be neg NaN)" << endl;
   ifval.i=0xFF800002;
   cout << "0xFF800002=" <<  ifval.f<< " (should be neg NaN)"  << endl;
   ifval.i=0xFF800000;
   cout << "0xFF800000=" <<  ifval.f << endl;
   ifval.i=0x80000000;
   cout << "0x80000000=" <<  ifval.f << endl;
   ifval.i=0x00000000;
   cout << "0x00000000=" <<  ifval.f << endl;
   ifval.i=0x7F800000;
   cout << "0x7F800000=" <<  ifval.f << endl;
   ifval.i=0x7F800001;
   cout << "0x7F800001=" <<  ifval.f << endl;
}

These special values are dealt with inefficiently by some chips - Intels can be 900x slower than AMD when they're used. These special values are not dealt with in all the tricks below. Beware!

Macros and inline functions

Function-calling involves overheads. C and C++ let you use macros. The following program uses 2 macros. FI converts a float's bit-representation to an integer; IsZero checks whether all the bits are 0, ignoring the sign bit.

int main(){
#define FI(f) (*((int*)&(f))) 
#define IsZero(f) ((FI(f)<<1)==0)

float f=5.7;
bool b=IsZero(f);
}

The bool b=IsZero(f) line looks like a function call but the pre-processor converts it using the macro definition to the following before the code reaches the real compiler

    bool b=(((*((int*)&(f)))<<1)==0);

C++ has an inline keyword. If you use it before a function definition the function is treated like a macro - i.e. its code will be replicated each time the "function" is called, making the code bigger even if it's not faster.

Comparisons

Because of the floating point representation, comparisons with 0 or 1 can be dealt with as a special case. The following code shows 2 routines that given a number return the nearest number that's in the range (0 ... 1). The fastwindow01 function manipulates the underlying bits of the number.

#include <iostream>

using namespace std;

#define SIGNMASK(i) (~((((unsigned int)(i))>>31)-1))
typedef union {float f; int i; } IntOrFloat;
IntOrFloat ifval;

float slowwindow01(float f) {
  if(f<0)
    return 0;
  if (f>1)
    return 1;
  return f;
}

float fastwindow01(float f) {
  int ival;
  ifval.f=f;
  //Min 0
  ival=SIGNMASK(ifval.i);
  ifval.i &= ~ival;

  //Max 1
  ifval.f=1.0f - ifval.f;
  ival=SIGNMASK(ifval.i);
  ifval.i &= ~ival;
  ifval.f = 1.0f - ifval.f;
  return ifval.f;
}

int main() {
  float fval=3.14159;
  cout << "Pi windowed to 0,1 range=" <<  fastwindow01(fval) << endl;

  fval=-0.78;
  cout << "-0.78 windowed to 0,1 range=" <<  fastwindow01(fval) << endl;  
}

Speed at the expense of accuracy

You may not require very precise answers. The following routine produces square-root results that are at most a factor of 3/(2*sqrt(2)) from the right answer.

#include <iostream>
#include <cmath>

using namespace std;

float fastsqrt(float fval) {
   // Fast square root
   int ival;
   ival=(*(int*)&fval);
   ival&=0x7FFFFFFF; // fixes sqrt(-0) bug
   ival-=0x3f800000; // subtract 127
   ival>>=1; // requires signed shift to preserve sign (exponent/2)
   ival += 0x3f800000; // rebias new exponent
   fval=(*(float*)&ival);
   return fval;
}

int main() {
   cout << "The square root of 3.14159 is " << fastsqrt(3.14159)
   << " or " << sqrt(3.14159) << endl;
}
 

A similar sacrifice is used in Fast inverse square root and the following. The average error for the exp function below is about 1.5%.

#include <cmath>
#include <iostream>
using namespace std;

static union 
{
  double d;
  struct {
#ifdef LITTLE_ENDIAN
    int j,i;
#else 
    int i,j;
#endif
  } n;
} _eco;

#define EXP_A (1048576/0.69314718055994530942)
#define EXP_C 60801
#define EXP(y) (_eco.n.i = EXP_A*(y) + (1072693248 - EXP_C), _eco.d)

int main() {
   cout << "The exponential of 3.14159 is " << EXP(3.14159) << " or " << exp(3.14159) << endl;
}

The code in the next section also has an approximate way of testing to within a tolerance the equality of 2 floats.

Timing

The code below uses all the tricks above and times them against standard methods.

#include <iostream>
#include <cstdlib>
#include <sys/times.h>
#include <cmath>

using namespace std;

tms timestats;
clock_t starttime;

static union 
{
  double d;
  struct {
#ifdef LITTLE_ENDIAN
    int j,i;
#else 
    int i,j;
#endif
  } n;
} _eco;

#define EXP_A (1048576/0.69314718055994530942)
#define EXP_C 60801
#define EXP(y) (_eco.n.i = EXP_A*(y) + (1072693248 - EXP_C), _eco.d)

#define SIGNMASK(i) (~((((unsigned int)(i))>>31)-1))

#define FI(f) (*((int*)&(f)))
#define FU(f) (*((unsigned int*)&(f)))

#define LessThanZero(f) (FU(f) > 0x80000000U)
#define IsZero(f) ((FI(f)<<1)==0)

typedef union {float f; int i; } IntOrFloat;
IntOrFloat ifval;

float farray[1000000];

int ival;
float fval;

float slowwindow01(float f) {
if(f<0)
  return 0;
if (f>1)
  return 1;
return f;
}

float fastwindow01(float f) {
  int ival;
  ifval.f=f;
  //Min 0
  ival=SIGNMASK(ifval.i);
  ifval.i &= ~ival;

  //Max 1
  ifval.f=1.0f - ifval.f;
  ival=SIGNMASK(ifval.i);
  ifval.i &= ~ival;
  ifval.f = 1.0f - ifval.f;
  return ifval.f;
}

float fastsqrt(float fval) {
   // Fast square root
   int ival;
   ival=(*(int*)&fval);
   ival&=0x7FFFFFFF; // fixes sqrt(-0) bug
   ival-=0x3f800000; // subtract 127
   ival>>=1; // requires signed shift to preserve sign (exponent/2)
   ival += 0x3f800000; // rebias new exponent
   fval=(*(float*)&ival);
   return fval;
}

bool LomontCompare(float af, float bf, int maxDiff)
{
int ai;
int bi;
 ai=(*(int*)&af);
 bi=(*(int*)&bf);
 int test = SIGNMASK(ai^bi);
 int diff = (((0x80000000 -ai)&(test)) | (ai & (~test))) -bi;
 int v1= maxDiff + diff;
 int v2= maxDiff - diff;
 return (v1|v2) >=0;
}

bool Compare(float af, float bf, float maxDiff)
{
  float diff;
  diff=af-bf;

  if (fabs(maxDiff*af/100.0) > fabs(af))
       return true;
  else
       return false; 
}

int main() {
  fval=3.14159;
  for (unsigned long i=0;i<1000000;i++)
    farray[i]=drand48();
  // The next line of code is undefined behaviour in C99
  // You could instead use memcpy(&ival,&fval,sizeof(int));
  ival=(*(int*)&fval);

  cout << boolalpha; 

  cout << "*** Checking bit patterns " << endl;
  ifval.i=0xFFFFFFFF;
  cout << "0xFFFFFFFF=" <<  ifval.f << " (should be neg NaN)" << endl;
  ifval.i=0xFF800002;
  cout << "0xFF800002=" <<  ifval.f<< " (should be neg NaN)"  << endl;
  ifval.i=0xFF800000;
  cout << "0xFF800000=" <<  ifval.f << endl;
  ifval.i=0x80000000;
  cout << "0x80000000=" <<  ifval.f << endl;
  ifval.i=0x00000000;
  cout << "0x00000000=" <<  ifval.f << endl;
  ifval.i=0x7F800000;
  cout << "0x7F800000=" <<  ifval.f << endl;
  ifval.i=0x7F800001;
  cout << "0x7F800001=" <<  ifval.f << endl;

  cout << endl << "*** Fast comparisons" <<  endl;
  cout << fval << "<0 is " <<  LessThanZero(fval) << endl;
  fval=-fval;
  cout << fval << "<0 is " <<  LessThanZero(fval) << endl;
  cout << fval << "==0 is " <<  IsZero(fval) << endl;
  fval=0;
  cout << fval << "==0 is "<<  IsZero(fval) << endl;

  fval=3.14159;
  cout << "*** Windowing " << endl;
  cout << "Pi windowed to 0,1 range=" <<  fastwindow01(fval) << endl;

  fval=-0.78;
  cout << "-0.78 windowed to 0,1 range=" <<  fastwindow01(fval) << endl;  

  cout << "\n*** Comparing with a tolerance (using the same tolerance each time)" << endl;
  cout << "Comparing 7.111 and 7.112 - ";
  cout << LomontCompare(7.111, 7.112, 1000) << endl;
  cout << "Comparing 7.1111 and 7.1112 - ";
  cout << LomontCompare(7.1111, 7.1112, 1000) << endl;
  cout << "Comparing 7111 and 7112 - ";
  cout << LomontCompare(7111, 7112, 1000) << endl;
  cout << "Comparing 71111 and 71112 - ";
  cout << LomontCompare(71111, 71112, 1000) << endl;

  cout << "\n*** Long experiments" << endl;
 
 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=farray[i]<0;
 cout << "Slow comparison time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=LessThanZero(farray[i]);
 cout << "Fast comparison time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=sqrt(farray[i]);
 cout << "\nSlow sqrt time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=fastsqrt(farray[i]);
 cout << "Fast sqrt time=" << times(&timestats) - starttime << endl;
 
 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=exp(farray[i]);
 cout << "\nSlow exp time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=EXP(farray[i]);
 cout << "Fast exp time=" << times(&timestats) - starttime << endl;
 
starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
       fval=Compare(7.111, 7.112,0.01);
 cout << "\nSlow compare time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
       fval=LomontCompare(7.111, 7.112,1000);
 cout << "Fast compare time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=slowwindow01(farray[i]);
 cout << "\nSlow windowing time=" << times(&timestats) - starttime << endl;

 starttime=times(&timestats);
 for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
        fval=fastwindow01(farray[i]);
 cout << "Fast windowing time=" << times(&timestats) - starttime << endl;
}

On a Pentium D machine using g++ without optimising, these were the results for the long experiments

*** Long experiments Slow comparison time=37 Fast comparison time=38 Slow sqrt time=153 Fast sqrt time=160 Slow exp time=427 Fast exp time=346 Slow compare time=301 Fast compare time=169 Slow windowing time=97 Fast windowing time=226

Using g++ -O these were the results for the long experiments

*** Long experiments Slow comparison time=24 Fast comparison time=14 Slow sqrt time=118 Fast sqrt time=50 Slow exp time=306 Fast exp time=286 Slow compare time=7 Fast compare time=123 Slow windowing time=54 Fast windowing time=109

Note in particular the change in speed of the "slow compare" test. With g++ -O4 these were the results for the long experiments

*** Long experiments Slow comparison time=7 Fast comparison time=6 Slow sqrt time=118 Fast sqrt time=13 Slow exp time=308 Fast exp time=244 Slow compare time=7 Fast compare time=15 Slow windowing time=7 Fast windowing time=9

Now the "fast compare" test has been accelerated and the windowing routines are much faster too.

Conclusions

The "fast" versions were written on the assumption that floating point arithmetic is much slower than integer arithmetic. This is less true than it used to be. Note that

  • Optimising has had very different effects on the slow and fast versions of the routines.
    OptSqrtCompareWindowing
    SlowFastSlowFastSlowFast
    None15316030116997226
    -O11850712354109
    -O41181371579
  • Aggressive optimisation might eliminate code that seems to be doing nothing useful, so it's likely that some of the g++ -O4 results are bogus
  • Neither the functions' code or the test code are ideal.

The moral is to time the code using as realistic a setting as possible.

References