C++: Nasty Tricks
You may be prepared to try to speed up your code at the expense of readability. Below are some examples of such code. The speed or memory benefits vary, as does the legibility of the resulting code. In some cases the tricks are slower than the standard methods, so beware!
Bits and Bytes
Rather than use 32 ints to store 32 yes/no values, you can use one 32-bit int, using each bit as a flag. The following code (which has no error checking!) shows how to do this using a new C++ facility (bitset) or an older C method that also works in C++. These methods should produce faster and smaller programs without damaging the legibility, so they're not really nasty. The bit-manipulation methods will be useful later on.
Floating point representation
IEEE-754 conformance for floating point arithmetic is very common nowadays. Knowing how floats are stored can help when trying to speed up code, but you'll need to be aware of some special bit-patterns. Here's a program using some of them. The union is used so that ifval.i and ifval.f occupy the same 4 bytes; changing ifval.i will change ifval.f. "0x" is the prefix for hexadecimal numbers.
#include <iostream> using namespace std; typedef union {float f; int i; } IntOrFloat; IntOrFloat ifval; int main(){ // Intels slower than AMD when NaN/Inf used. Maybe 900x slower // floats and doubles about the same cout << "*** Checking bit patterns " << endl; ifval.i=0xFFFFFFFF; cout << "0xFFFFFFFF=" << ifval.f << " (should be neg NaN)" << endl; ifval.i=0xFF800002; cout << "0xFF800002=" << ifval.f<< " (should be neg NaN)" << endl; ifval.i=0xFF800000; cout << "0xFF800000=" << ifval.f << endl; ifval.i=0x80000000; cout << "0x80000000=" << ifval.f << endl; ifval.i=0x00000000; cout << "0x00000000=" << ifval.f << endl; ifval.i=0x7F800000; cout << "0x7F800000=" << ifval.f << endl; ifval.i=0x7F800001; cout << "0x7F800001=" << ifval.f << endl; }
These special values are dealt with inefficiently by some chips - Intels can be 900x slower than AMD when they're used. These special values are not dealt with in all the tricks below. Beware!
Macros and inline functions
Function-calling involves overheads. C and C++ let you use macros. The following program uses 2 macros. FI converts a float's bit-representation to an integer; IsZero checks whether all the bits are 0, ignoring the sign bit.
int main(){ #define FI(f) (*((int*)&(f))) #define IsZero(f) ((FI(f)<<1)==0) float f=5.7; bool b=IsZero(f); }
The bool b=IsZero(f) line looks like a function call but the pre-processor converts it using the macro definition to the following before the code reaches the real compiler
bool b=(((*((int*)&(f)))<<1)==0);
C++ has an inline keyword. If you use it before a function definition the function is treated like a macro - i.e. its code will be replicated each time the "function" is called, making the code bigger even if it's not faster.
Comparisons
Because of the floating point representation, comparisons with 0 or 1 can be dealt with as a special case. The following code shows 2 routines that given a number return the nearest number that's in the range (0 ... 1). The fastwindow01 function manipulates the underlying bits of the number.
#include <iostream> using namespace std; #define SIGNMASK(i) (~((((unsigned int)(i))>>31)-1)) typedef union {float f; int i; } IntOrFloat; IntOrFloat ifval; float slowwindow01(float f) { if(f<0) return 0; if (f>1) return 1; return f; } float fastwindow01(float f) { int ival; ifval.f=f; //Min 0 ival=SIGNMASK(ifval.i); ifval.i &= ~ival; //Max 1 ifval.f=1.0f - ifval.f; ival=SIGNMASK(ifval.i); ifval.i &= ~ival; ifval.f = 1.0f - ifval.f; return ifval.f; } int main() { float fval=3.14159; cout << "Pi windowed to 0,1 range=" << fastwindow01(fval) << endl; fval=-0.78; cout << "-0.78 windowed to 0,1 range=" << fastwindow01(fval) << endl; }
Speed at the expense of accuracy
You may not require very precise answers. The following routine produces square-root results that are at most a factor of 3/(2*sqrt(2)) from the right answer.
#include <iostream> #include <cmath> using namespace std; float fastsqrt(float fval) { // Fast square root int ival; ival=(*(int*)&fval); ival&=0x7FFFFFFF; // fixes sqrt(-0) bug ival-=0x3f800000; // subtract 127 ival>>=1; // requires signed shift to preserve sign (exponent/2) ival += 0x3f800000; // rebias new exponent fval=(*(float*)&ival); return fval; } int main() { cout << "The square root of 3.14159 is " << fastsqrt(3.14159) << " or " << sqrt(3.14159) << endl; }
A similar sacrifice is used in Fast inverse square root and the following. The average error for the exp function below is about 1.5%.
#include <cmath> #include <iostream> using namespace std; static union { double d; struct { #ifdef LITTLE_ENDIAN int j,i; #else int i,j; #endif } n; } _eco; #define EXP_A (1048576/0.69314718055994530942) #define EXP_C 60801 #define EXP(y) (_eco.n.i = EXP_A*(y) + (1072693248 - EXP_C), _eco.d) int main() { cout << "The exponential of 3.14159 is " << EXP(3.14159) << " or " << exp(3.14159) << endl; }
The code in the next section also has an approximate way of testing to within a tolerance the equality of 2 floats.
Timing
The code below uses all the tricks above and times them against standard methods.
#include <iostream> #include <cstdlib> #include <sys/times.h> #include <cmath> using namespace std; tms timestats; clock_t starttime; static union { double d; struct { #ifdef LITTLE_ENDIAN int j,i; #else int i,j; #endif } n; } _eco; #define EXP_A (1048576/0.69314718055994530942) #define EXP_C 60801 #define EXP(y) (_eco.n.i = EXP_A*(y) + (1072693248 - EXP_C), _eco.d) #define SIGNMASK(i) (~((((unsigned int)(i))>>31)-1)) #define FI(f) (*((int*)&(f))) #define FU(f) (*((unsigned int*)&(f))) #define LessThanZero(f) (FU(f) > 0x80000000U) #define IsZero(f) ((FI(f)<<1)==0) typedef union {float f; int i; } IntOrFloat; IntOrFloat ifval; float farray[1000000]; int ival; float fval; float slowwindow01(float f) { if(f<0) return 0; if (f>1) return 1; return f; } float fastwindow01(float f) { int ival; ifval.f=f; //Min 0 ival=SIGNMASK(ifval.i); ifval.i &= ~ival; //Max 1 ifval.f=1.0f - ifval.f; ival=SIGNMASK(ifval.i); ifval.i &= ~ival; ifval.f = 1.0f - ifval.f; return ifval.f; } float fastsqrt(float fval) { // Fast square root int ival; ival=(*(int*)&fval); ival&=0x7FFFFFFF; // fixes sqrt(-0) bug ival-=0x3f800000; // subtract 127 ival>>=1; // requires signed shift to preserve sign (exponent/2) ival += 0x3f800000; // rebias new exponent fval=(*(float*)&ival); return fval; } bool LomontCompare(float af, float bf, int maxDiff) { int ai; int bi; ai=(*(int*)&af); bi=(*(int*)&bf); int test = SIGNMASK(ai^bi); int diff = (((0x80000000 -ai)&(test)) | (ai & (~test))) -bi; int v1= maxDiff + diff; int v2= maxDiff - diff; return (v1|v2) >=0; } bool Compare(float af, float bf, float maxDiff) { float diff; diff=af-bf; if (fabs(maxDiff*af/100.0) > fabs(af)) return true; else return false; } int main() { fval=3.14159; for (unsigned long i=0;i<1000000;i++) farray[i]=drand48(); // The next line of code is undefined behaviour in C99 // You could instead use memcpy(&ival,&fval,sizeof(int)); ival=(*(int*)&fval); cout << boolalpha; cout << "*** Checking bit patterns " << endl; ifval.i=0xFFFFFFFF; cout << "0xFFFFFFFF=" << ifval.f << " (should be neg NaN)" << endl; ifval.i=0xFF800002; cout << "0xFF800002=" << ifval.f<< " (should be neg NaN)" << endl; ifval.i=0xFF800000; cout << "0xFF800000=" << ifval.f << endl; ifval.i=0x80000000; cout << "0x80000000=" << ifval.f << endl; ifval.i=0x00000000; cout << "0x00000000=" << ifval.f << endl; ifval.i=0x7F800000; cout << "0x7F800000=" << ifval.f << endl; ifval.i=0x7F800001; cout << "0x7F800001=" << ifval.f << endl; cout << endl << "*** Fast comparisons" << endl; cout << fval << "<0 is " << LessThanZero(fval) << endl; fval=-fval; cout << fval << "<0 is " << LessThanZero(fval) << endl; cout << fval << "==0 is " << IsZero(fval) << endl; fval=0; cout << fval << "==0 is "<< IsZero(fval) << endl; fval=3.14159; cout << "*** Windowing " << endl; cout << "Pi windowed to 0,1 range=" << fastwindow01(fval) << endl; fval=-0.78; cout << "-0.78 windowed to 0,1 range=" << fastwindow01(fval) << endl; cout << "\n*** Comparing with a tolerance (using the same tolerance each time)" << endl; cout << "Comparing 7.111 and 7.112 - "; cout << LomontCompare(7.111, 7.112, 1000) << endl; cout << "Comparing 7.1111 and 7.1112 - "; cout << LomontCompare(7.1111, 7.1112, 1000) << endl; cout << "Comparing 7111 and 7112 - "; cout << LomontCompare(7111, 7112, 1000) << endl; cout << "Comparing 71111 and 71112 - "; cout << LomontCompare(71111, 71112, 1000) << endl; cout << "\n*** Long experiments" << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=farray[i]<0; cout << "Slow comparison time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=LessThanZero(farray[i]); cout << "Fast comparison time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=sqrt(farray[i]); cout << "\nSlow sqrt time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=fastsqrt(farray[i]); cout << "Fast sqrt time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=exp(farray[i]); cout << "\nSlow exp time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=EXP(farray[i]); cout << "Fast exp time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=Compare(7.111, 7.112,0.01); cout << "\nSlow compare time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=LomontCompare(7.111, 7.112,1000); cout << "Fast compare time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=slowwindow01(farray[i]); cout << "\nSlow windowing time=" << times(×tats) - starttime << endl; starttime=times(×tats); for (unsigned long i=0;i<1000000;i++) for (unsigned long j=0;j<100;j++) fval=fastwindow01(farray[i]); cout << "Fast windowing time=" << times(×tats) - starttime << endl; }
On a Pentium D machine using g++ without optimising, these were the results for the long experiments
Using g++ -O these were the results for the long experiments
Note in particular the change in speed of the "slow compare" test. With g++ -O4 these were the results for the long experiments
Now the "fast compare" test has been accelerated and the windowing routines are much faster too.
Conclusions
The "fast" versions were written on the assumption that floating point arithmetic is much slower than integer arithmetic. This is less true than it used to be. Note that
- Optimising has had very different effects on the slow and fast versions of the routines.
Opt Sqrt Compare Windowing Slow Fast Slow Fast Slow Fast None 153 160 301 169 97 226 -O 118 50 7 123 54 109 -O4 118 13 7 15 7 9 - Aggressive optimisation might eliminate code that seems to be doing nothing useful, so it's likely that some of the g++ -O4 results are bogus
- Neither the functions' code or the test code are ideal.
The moral is to time the code using as realistic a setting as possible.
References
- A Fast, Compact Approximation of the Exponential Function by Nicol N. Schraudolph in Neural Computation, 11,853-862 (1999)
- "Game Programming Gems 6", Michael Dickheiser (ed), Charles River Media, 2006
- Floating-Point Basics (Pete Becker)
- Bit Twiddling Hacks (Sean Eron Anderson)
- What Every Computer Scientist Should Know About Floating-Point Arithmetic (David Goldberg)
- Comparing floating point numbers (Bruce Dawson)