Department of Engineering

IT Services

Specialist Areas


If you're using any of the maths routines (sqrt, sin, etc) remember that you'll need to include cmath to declare the routines.

Before you start writing much maths-related code, check to see that it hasn't all been done before. Many maths routines, including routines that offer arbitrary precision are available from netlib. Other resources are listed on CUED's maths page. Before you do any heavy computation, especially with real numbers, I suggest that you browse through a Numerical Analysis book. Things to avoid are

  • Finding the difference between very similar numbers (if you're summating an alternate sign series, add all the positive terms together and all the negative terms together, then combine the two).
  • Dividing by a very small number (try to change the order of operations so that this doesn't happen).
  • Multiplying by a very big number.

Common problems that you might face are :-

Testing for equality :-
Real numbers are handled in ways that don't guarantee expressions to yield exact results. Just as in base 10 to 2 decimal places, 3 * (10/3) doesn't equal 10, so computer arithmetic has its limitations. It's especially risky to test for exact equality. Better is to use something like

  d = max(1.0, fabs(a), fabs(b))

and then test fabs(a - b) / d against a relative error margin. Useful constants in <climits> are FLT_EPSILON, DBL_EPSILON, and LDBL_EPSILON, defined to be the smallest numbers such that

 1.0f + FLT_EPSILON  != 1.0f
 1.0  + DBL_EPSILON  != 1.0 
 1.0L + LDBL_EPSILON != 1.0L


Avoiding over- and underflow :-
You can test the operands before performing an operation in order to check whether the operation would work. You should always avoid dividing by zero. For other checks, split up the numbers into fractional and exponent part using the frexp() and ldexp() library functions and compare the resulting values against HUGE (all in <cmath>).

Floats and Doubles :-
You can use the information in <limits> or use sizeof(double), etc to compare the size of float, double and long double.

  • Keep in mind that the double representation does not necessarily increase the precision over float. Actually, in most implementations the worst-case precision decreases but the range increases.

  • Do not use double or long double unnecessarily since there may a large performance penalty. Furthermore, there is no point in using higher precision if the additional bits which will be computed are garbage anyway. The precision one needs depends mostly on the precision of the input data and the numerical method used.
    #include <iostream>
    using namespace std;
    int main(int argc, char* argv[])
    float f1;
    double d1;
    long double ld1;
    cout << "f1=" << f1 << ", f1**2=" << f1*f1 << endl;
    cout << "d1=" << d1 << ", d1**2=" << d1*d1 << endl;
    cout << "ld1=" << ld1 << ", ld1**2=" << ld1*ld1 << endl;
    return 0;

Infinity :-

The IEEE standard for floating-point maths recommends a set of functions to be made available. Among these are functions to classify a value as NaN, Infinity, Zero, Denormalized, Normalized, and so on. Most implementations provide this functionality, although there are no standard names for the functions. Such implementations often provide predefined identifiers (such as _NaN, _Infinity, etc) to allow you to generate these values.

If x is a floating point variable, then (x != x) will be TRUE if and only if x has the value NaN. Some C++ implementations claim to be IEEE 748 conformant, but if you try the (x!=x) test above with x being a NaN, you'll find that they aren't.

The following sythesises a NaN on HP machines, then uses isnan().

#include <iostream>
#include <cmath>
using namespace std;

double make_a_NaN ()
double ret;
char* pointer = reinterpret_cast<char*>(&ret);
int i;
// IEEE spec for doubles on HPs
//   0 : sign bit
//   1-11: exponent
//   12-63: fraction
// For QNaNs on HPs
// set 1-11 to 1, 12=0, rest to 1

for (i=0; i<8; i++)
  *(pointer+i) = 255;
*(pointer+1)= 247;
return ret;

int main(int argc, char* argv[])
double d;

d=make_a_NaN ();
if (isnan(d))
  cout << d << " is a nan" << endl; 
  cout << d << " is not a nan" << endl;
return 0;


An increasing amount of maths work is being done with C++. One development is the Template Numerical Toolkit built around the Standard Library.

Hardware Interfacing: bit operations and explicit addresses

If you're interfacing with hardware and need to operate on bits you can use bitset but you may prefer to use C++'s low-level operations.

Setting a bit :-
Suppose you wanted to set bit 6 of i (a long, say) to 1. First you need to create a mask that has a 1 in the 6th bit and 0 elsewhere by doing `1L<<6' which shifts all the bits of the long 1 left 6 bits. Then you need to do a bit-wise OR using `i = i | (1L<<6)'.

Unsetting a bit :-
Suppose you wanted to set bit 6 of i (a long, say) to 0. First you need to create a mask that has a 0 in the 6th bit and 1 elsewhere by doing `1L<<6' then inverting the bits using the ~ operator. Then you need to do a bit-wise AND using the & operator. The whole operation is `i =i & ~(1<<6)' which can be contracted to `i &= ~(1<<6)'.

Creating a mask :-
In graphics, masks are often created each of whose bits represent a option that is to be selected in some way. Each bit can be referred to using an alias that has been set up in an include file. For example, a mask which could be used in a call to make a window sensitive to key presses and buttonpresses could be set up by doing

unsigned int mask = KeyPressMask | ButtonPressMask ;

Suppose because of hardware considerations the i variable needed to be the value of memory location 2000. The following code will do what you want.

long *memory_pointer = reinterpret_cast<long*>(2000);
long i = *memory_pointer;

Using routines from other languages


When your C++ code is presented with the prototypes of C routines, the prototypes need to be bracketed by the following

#ifdef __cplusplus
extern "C" {
#ifdef __cplusplus

so that the code compiles.


This is a rather machine-specific issue. Rudi Vankemmel ( mentions some general points that are worth noting :-

Fortran uses a column wise storage of matrices while C++ stores them row wise. This means that when you want to parse a matrix from your C++ program to the fortran routine you must transpose the matrix in your program before entering the routine. Of course, any output from such a routine must be transposed again.

If you omit this step, then probably your program will run (because it has data to compute on) but it will generate wrong answers.

If you have the Fortran source code (of any routine) then on some platforms you may use compiler directives specifying that the Fortran compiler must use row wise storage. Some platforms support these directives. However watch out with this if you call the same routine from another Fortran routine/program.

Your Fortran compiler may add an underscore to the routine name in the symbol table. Hence in the calling C++ program you must add a trailing underscore! Otherwise the loader will complain about an undefined symbol.

Fortran passes its variables by reference. This means that you MUST give adresses in your calling C++ program.

Watch out especially with floats and doubles. Make sure that the size of the variable in the calling program is identical to the size in the Fortran routine.