Trees, Graphs and C++

This document aims to provide enough background information to encourage you to write graph-related C++ code. First some Standard Containers are shown in action, and their use extended to deal with user-defined classes. Then some Tree and Graph concepts are introduced. Finally implementation ideas are given to help tackle some exercises, building on C++'s supplied containers.

Introduction to Standard Containers

Many books exist dealing with sorting and organising data. With the emergence of languages that support data abstraction, people tried to generalise data structures. The Smalltalk language introduced some useful ideas. These have been developed and brought into the mainstream with C++'s Standard Library. The Standard Library includes containers that hold data. Containers are grouped into 3 types

Sequential - the order of the elements matter (e.g. vector)
Associative - the order of the elements doesn't matter (e.g. set)
Adapters - containers that are modifications of other containers (e.g. stack)

These containers have many member functions in common. For example, s.size() gives the number of elements in the container s if s is a set, a list or even a string of characters. The Standard Library also includes algorithms that operate on the containers. To be of maximum use these algorithms need to be able to operate on as many types of containers as possible, so each container has member functions that give algorithms access to the data. For example, s.begin() will always return an Iterator (like a pointer) to the first element of a container s. By hiding the implementation details of containers in this way, high-level generic algorithms can be written which will not only work with the standard containers but also with any new containers that a programmer might write.

The Standard Library doesn't contain a tree container, but if you wrote one so that it had the standard hooks that algorithms require, then algorithms like copy etc. should work with it. Clearly a tree container would require some special routines that other containers wouldn't support, but C++ makes it easy to add functionality in, building on what you already have.

Developing Containers

Ways to extend the basic use of containers include

Using standard containers to store user-defined objects (see later)
Creating completely new containers (see the References section for examples)
Building new containers from existing ones (note however that containers weren't designed to be inherited from - they don't have a virtual destructor for example - so take care)

Queues

Before we look at trees, let's see how to use some simpler containers.

#include <deque>                                                       
#include <algorithm>                                            
#include <iostream>                                                        
using namespace std;                                                    
                                                                       
int main()                                                                 
{                                                             
deque<int> Q;       //  An empty, elastic, container ready to hold ints.
// A deque is a double-ended queue. Queues (unlike lists) are
// data structures where you wouldn't expect to insert new items
// anywhere else but at the ends

Q.push_back(0);     //  Many containers have a push_back
                    //  routine. It's a way to add elements
                    //  to a container in a safe way - the
                    //  container will grow if it's not big enough.
Q.push_back(2);                                                          
Q.push_back(1);     

for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
}
cout <<endl;

// Now let's sort!          
sort(Q.begin(), Q.end());
// the arguments we've given to sort mean that the whole queue
// will be sorted, and that the result will be copied back to
// the original queue. sort uses the default sorting criterium
// for integers, and can use the swap function of Q to switch
// elements around. 

for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
}
cout <<endl;
// Output: 0 1 2
                    
}

To create a queue of strings rather than a queue of integers, we don't need to change much. This ease of coping with different data types will be useful when we come to create trees.

#include <deque>                                                       
#include <string> // added so that we can use strings
#include <algorithm>                                            
#include <iostream>                                                        
using namespace std;                                                    
                                                                       
int main()                                                                 
{                                                             
deque<string> Q;       // changed

Q.push_back("standard");           
Q.push_back("template");                                                        
Q.push_back("library");     

for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
}
cout <<endl;

// Now let's sort!          
sort(Q.begin(), Q.end());
// sort uses the default sorting criterium for strings - alphabetical

for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
}
cout <<endl;
}

As well as being able to have queues of standard items like strings and integers, we can also invent our own types and have queues of those. The main problem is that when we sort we have to specify the sort criteria for our own types. It's sufficient to specify what the '<' operator should do for your own type. Here's an example

#include <deque>                                                       
#include <algorithm>                                            
#include <iostream>                                                        
using namespace std;                                                    

class simple {
public:
  int value;
  // redefine the '<' operator, so when 2 simple objects are
  // compared, the 'value' fields of the objects are compared.
  const bool operator<(const simple &rhs) const
        { return this->value < rhs.value; }
};
                                                                       
int main()                                                                 
{                                                             
 deque<simple> Q;       //  An empty, elastic, container.

 simple s;
 s.value=0;
 Q.push_back(s); 
 s.value=2;
 Q.push_back(s);   
 s.value=1;                                          
 Q.push_back(s);     

 for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
 }
 cout <<endl;
 // Output: 0 2 1

 // Now let's sort! sort uses the '<' operator that we created in the 'simple' class   
 sort(Q.begin(), Q.end());

 for(int i=0;i<Q.size(); i++){
   cout << Q[i].value; cout << " ";
 }
 cout <<endl; 
 // Output: 0 1 2
}

Trees

There are many types of tree structures (threaded trees, Splay trees, 2-3 trees, etc). The choice depends on many factors - how dynamic the data is, whether best-case or worst-case behaviour is the most important, whether you want to avoid recursion, etc. Trees are often a natural choice (e.g. when writing a chess program) but they are also effective in many situations where data that's received sequentially can be beneficially stored in a more organised fashion - e.g. parsing mathematical expressions, parsing sentences, reading words that later need to be searched, or retrieved in alphabetical order.

Binaries trees (where each node has at most 2 children) are commonly used - not because they most closely model reality, but because they're computationally and mathematically more easy to deal with.

Suppose that a program were to be given 10,000 words (non-unique) and had to list the unique words used. Suppose also that memory was at a premium, so that storing every non-unique word is to be avoided. Each time a word is read the program needs to check if the word's new and store it only if it is. If the words are stored in an array alphabetically, searching will be faster, but adding a word to the start of the array will be slow because subsequent words will have to be shifted.

Using a binary tree will speed things up. Each time a word is read, the program starts at the top of the tree. If the node is empty, the word is stored there, otherwise the left or right branch is taken according to whether the word comes before or after the word stored in the node. For example, suppose the first 5 words were "standard", "template", "library", "data" and "structures". "standard" would be stored in the top node. "template" would be stored in the right-child node of the top node. "library" would be stored in the left-child node of the top node.

              standard
               /   \
          library  template

"data" comes before "standard" so the left link will be taken. "data" comes before "library", so "data" will be stored in the left-child node of "library". The final tree looks like this

              standard
               /   \
          library  template
             /      /
           data   structures

If this tree remains balanced, then it could fit 10,000 words within 15 layers, so a word-search would require only 15 comparisons at most. Insertions (new nodes) are cheap too. Contrast this with the worst case array scenario, where 10,000 comparisons might need to be made before a search is complete!
However, if the words are supplied in alphabetical order, this tree-based method suffers - each new word will add a right-node, deepening the tree each time, making searching as slow as with arrays. Various techniques exist to combat this. The benefits of these techniques are so enormous that simple binary trees aren't used in big programs - see the appendix.

Tree Traversal

There are 2 basic approaches to brute-force tree searching - breadth-first and depth-first. With breadth-first you start at the top node then visit all its childen before visiting its children's childen - i.e. you explore the tree layer by layer. With depth-first searching you keep going down, backtracking when you can go no further. The purpose of the search and the shape of the tree affect the choice of approach. For example, if you're playing chess and you don't have much time to think, it's better to do a breadth-first search rather than looking at one move (which might be bad anyway) at great depth.

With either method there's a choice of ways to order operations on the node.

pre-order - node, left-branch, right-branch
in-order - left-branch, node, right-branch
post-order - left-branch, right-branch, node

As an example consider the following tree, which represents a mathematical expression.

             -
           /   \
          *     *
         / \   / \
        2   + 3   -
           / \   / \
          x   5 y   4

If you do a depth-first search, processing each left-branch and right-branch of a node before printing the node's contents you'll get

      2 x 5 + * 3 y 4 - * -

(check to see if you agree with this!) which is the expression in post-fix notation but if you deal with nodes differently (still depth-first search, but this time processing the left-branch, printing the contents of the node, then processing the right-branch) you'll get something much closer to the normal representation.

The depth-first algorithm can be described more formally as follows

Form one element queue Q consisting of the root node
Until the Q is empty or the goal has been reached, determine if the first element in Q is the goal.

If it is, do nothing
If is isn't, remove the first element from Q and add the first element's children, if any, to the front of the queue (so that children are searched before siblings)

If the goal is reached, success; else failure

Try this algorithm by hand on the following tree

             A
           /   \
          B     C
         / \   / \
        D   E F   G
           / \   / \
          H   I J   K

The Standard Library makes it relatively easy to translate this algorithm into code nearly line-by-line. If each vertex has a list of sub-vertices called child then

    
    list<Vertex > q;
    q.push_back( start );

creates and initialises a list, and

     while( !q.empty( ) ){
         Vertex v = q.front( ); // get the first item
         if (!v.goal_reached()) { 
            q.pop_front( );      // remove it from the list
            for( int i = 0; i < v.child.size( ); i++ )
               q.push_front(v.child[i]); // push each child to the front of the queue
         }
     }

modifies the list.

It's sometimes possible to sort this list of nodes to be visited so that the most promising nodes are visited first.

Hill climbing - This method tries the locally-best candidate. It's like trying to find the highest peak in the highlands by always taking the steepest slope from your present location. In terms of the algorithm above, it means sorting the children before adding them to the front of the queue.
Best-first - With this method, it's not just the children which are sorted; the whole queue is sorted.

Suppose you started at point A of this 2D landscape.

With the Hill climbing algorithm you'd first go to B (the highest available point) then C then D, before backtracking to A and going to E then F.
With the Best-first algorithm you'd first go to B, but then you'd go to E then F.

The code for these methods follows the stages of the algorithm closely. We've seen earlier how sort can be used to sort a list of data structures like Vertex

     deque<Vertex> q;
     deque<Vertex> new_wave;
     while( !q.empty( ) ){
         Vertex v = q.front( ); // get the first item
         if (!v.goal_reached()) { 
            q.pop_front( );      // remove it from the list
            for( int i = 0; i < v.child.size( ); i++ )
               new_wave.push_front(v.child[i]); // push each child onto a queue
            // Now decide whether to sort this new queue before adding
            // it to the main queue, or add it to the main queue before
            // sorting the whole queue
            if (mode==hill_climbing) {
              sort(new_wave.begin(),new_wave.end());
              while(!new_wave.empty()){
                 q.push_front(new_wave.back( ));
                 new_wave.pop_back( );
              }
            }
            else {
              while(!new_wave.empty()){
                 q.push_front(new_wave.back( ));
                 new_wave.pop_back( );
              }
              sort(q.begin(),q.end());
            }
         }
     }

Branch and Bound methods break the set of feasible solutions into smaller subsets, and find for each subset a lower bound (as high as possible) on the function one's trying to maximise in the hope of isolating the optimal solution. These subsets often take the form of subtrees. If the subsets are well chosen, and if the bounds are easy to calculate, these methods can greatly reduce the number of nodes that need to be explicitly searched.

Many tree traversal strategies require the program to remember previous nodes in the path so that backtracking is possible. This (and recursion) can use up a lot of memory. With threaded trees each node knows its sucessor - with a depth-first search parents lead to their first child which in turn leads to siblings. The last sibling leads to the next node to be visited. Maintaining such a tree requires more work when nodes are added or removed, and makes traversal less flexible but with each node having a successor, it becomes more possible to treat the Graph as a container on which iterators can operate.

Graphs

A tree is a special case of the more general graph (or net). Graphs have many uses from network analysis to spreadsheet operation. Graphs have vertices (or nodes) and edges (which can be one-way or undirected). Edges can have a cost associated with them, and the cost may depend on the direction taken along the edge. Popular graph-related puzzles include

Finding whether one can traverse all edges without traversing any twice (an Euler Path).
The Travelling Salesman Problem - how to visit all the nodes at a minimal cost
Finding the shortest (least cost) path between 2 vertices
Finding the "minimal spanning tree" - finding a tree (with the least-cost edges) that includes all nodes

More formally, a graph is a pair (V,E), where V is a finite set and E is a binary relation on V. V is called a vertex set whose elements are called vertices. E is a collection of edges, where an edge is a pair (u,v) with u,v in V. In a directed graph, edges are ordered pairs, connecting a source vertex to a target vertex. In an undirected graph edges are unordered pairs and connect the two vertices in both directions, hence in an undirected graph (u,v) and (v,u) are two ways of writing the same edge.

If some edge (u,v) is in graph , then vertex v is adjacent to vertex u. In a directed graph, edge (u,v) is an out-edge of vertex u and an in-edge of vertex v. In an undirected graph edge (u,v) is incident on vertices u and v. In a directed graph, the number of out-edges of a vertex is its out-degree and the number of in-edges is its in-degree. For an undirected graph, the number of edges incident to a vertex is its degree.

A path is a sequence of edges in a graph such that the target vertex of each edge is the source vertex of the next edge in the sequence. If there is a path starting at vertex u and ending at vertex v we say that v is reachable from u. A path is simple if none of the vertices in the sequence are repeated. A path is a cycle if the first and last vertex in the path are the same. A graph with no cycles is acyclic.

Graph Traversal

Graphs can be traversed much as trees can (depth-first, breadth-first, etc), but care must be taken not to get stuck in a loop - trees by definition don't have cycles, and in a tree there's always only one path from the root to a node whereas in a graph there may be many paths between any pair of nodes.

The following directed graph has 6 nodes. The Dijkstra-Moore algorithm can be used to find the shortest path from a given node to each of the others as long as costs aren't negative. A breadth-first search from the first node radiates out across the graph. If it reaches an unvisited node, it sets its distance from the first node to be the distance of the previous node from the first node, plus the length of the edge from the previous node. If a node's already been visited, it will already have been given a distance from the first node, but this might not be the shortest distance - the current path might be shorter. So a comparison is done, and the node's distance is set to the minimum.

For example, in the graph below, suppose that A was the source node. In the first stage, node B's distance would be set to 7 and node C's distance to 1. During the second stage, routes from nodes B and C would be explored. When node B is reached from node C, the distance is 1+5, so node B's distance from A is reset to 6 from its original 7. As the algorithm progresses, node B's distance will later be reset to 5, which is the final value for node B's distance. The algorithm terminates after 5 stages because by then all of the routes will have been traversed - routes can't have more than number_of_vertices - 1 edges.

Implementing Graph Algorithms

Graphs provide useful source material for practising C++ - inheritance provides a useful mechanism for integrating node, edge and graph structures, and the Standard Library takes away much of the chore. Some issues to bear in mind are

You could start by writing some tree-traversal code to deal with the earlier examples, but don't forget that trees are a subset of graphs and that your code should be generalisable.
Think about how the tree/graph information will be set-up. If files are to be read in, what format should the files have?
As with the standard containers and algorithms, it might be worth trying to separate the graph information from the operations to be performed on the graph.

See the References below for a list of books and online courses. Below is one way that graph algorithms could be implemented, showing sample input and output.

A graph is a Standard Library vector of vertices. Each vertex has an Standard Library vector of edges. The vertices and edges can have associated information (cost, names, etc).

For flexibility the details of the search task are separate from the configuration of the graph.

enum ordering {preorder, inorder, postorder};
enum search_mode {breadth_first, depth_first, hill_climbing, best_first, Astar, shortest_paths, min_spanning_tree};
class Search_Description
{
public:
  ordering order;
  search_mode searching;
  string startnode;
  string goalnode;
  bool printnode;        // whether the visited node-name is printed
};

The graph and search specifications can be read from a file or be specified interactively. The minimum-distance problem above can be set up using the following file
```
A B 7 anon 
A C 1 anon 
B D 4 anon
B F 1 anon
C B 5 anon
C E 2 anon
C F 7 anon
E B 2 anon
E D 5 anon
F E 3 anon 
SEARCHMODE shortest
STARTNODE  A
```
Most of these lines specify directed edges with weights (the "anon" means that in this case the edges don't have names). Vertices are automatically created on demand.

When this file is fed into the program the output is

B is 5 from the source.
Shortest path is A to C to E to B
C is 1 from the source.
Shortest path is A to C
D is 8 from the source.
Shortest path is A to C to E to D
F is 6 from the source.
Shortest path is A to C to E to B to F
E is 3 from the source.
Shortest path is A to C to E

The program works with trees too. The hill climbing problem above can be set up with the following file

A B 1 anon
B C 1 anon
C D 1 anon
A E 1 anon
E F 1 anon
NODEVALUE A 0
NODEVALUE B 2
NODEVALUE C 0
NODEVALUE D 1
NODEVALUE E 1
NODEVALUE F 3
// For best-first change the next line to  SEARCHMODE bestfirst
SEARCHMODE hill
STARTNODE A
GOALNODE F

Graph member functions include

shortestpaths( const string & startnodename) - lists shortest paths from given node.
min_spanning_tree() - prints the edges of the minimum weight spanning tree
astarsearch(const string & startName, const string & goalnode) - Uses an A* search to find the goal node. Each node is assumed to have been given x and y coordinates (by using the NODECOORDS directive in the input file). The admissable heuristic used is distance travelled so far + euclidean distance from current node to specified goal node
cyclecheck() - checks if the graph contains a cycle
traverse(const search_mode & traverse_mode, const string & sourceName, const string &goalnode) - performs bread-first and depth-first traversals
informedsearch(const search_mode &,const string & sourceName, const string &goalnode) - does hill-climbing and best-first.

Graphs can also be created programatically rather than being read in from a file. Here's an example which creates a 6-node graph then works out the shortest paths from node "B" -

#include "graphs.cc"

int main( int argc, char *argv[ ] )
{
    Graph g;
    Search_Description s;
    vector<string> nodenames;

    nodenames.push_back("A");
    nodenames.push_back("B");
    nodenames.push_back("C");
    nodenames.push_back("D");
    nodenames.push_back("E");
    nodenames.push_back("F");

    for (int i=0;i<6;i++)
       for (int j=i+1;j<6;j++)
          g.addEdge(nodenames[i],nodenames[j],i+j,"anon");

    g.shortestpaths("B");
    return 0;
}

Exercises

Design (without writing code) a class for binary trees, describing the variables and member functions.

References

Note that the STL (Standard Template Library) has been subsumed into C++'s Standard Library.

Books

"Designing Components with the C++ STL: A New Approach to Programming", 2/e, Ulrich Breymann, Publisher: Addison Wesley, ISBN: 0-201-67488-2. In section III (BEYOND THE STL: COMPONENTS AND APPLICATIONS) it develops a Graph class along with some algorithms.
"Data Structures with STL", Murray and Pappas, Publisher: Prentice Hall, ISBN: 0-13-028927-2. Ends with graphs, looking at Depth-First and Breadth-First searching, Floyd's Algorithm Implementation with STL Vectors, etc.
"Graphs and Algorithms", Gondran and Minoux, Wiley, 1984. (in the CUED library). Maths.
"Data Structures via C++", Berman, A.M., OUP, 1997. (in the CUED library). No STL, but has trees and graphs.

Code and documentation

Theory - Algorithms - Complexity
The Boost Graph Library - a sophisticated STL-based library complete with documentation
Graph examples (Zeus-Framework)
Graphviz - Graph Visualization Software
http://www.mrs.umn.edu/~mcphee/Courses/1996_97/Spring/CSci_3190/Assignments/Graph/
http://www-ca.informatik.uni-tuebingen.de/dagstuhl/gpdag_28.html
http://www.bioinformatics.ucla.edu/events/202/background.html
http://www.cs.hope.edu/~alganim/ccaa/graph.html
http://www.student.seas.gwu.edu/~idsv/
"A Language and a Program for stating and solving combinatorial problems", Laurière, J.L., Artificial Intelligence, Vol. 10, n. 1, pp. 29-127 (1978) - describing ALICE, a language and system that can solve a wide range of problems ("Missionaries and Cannibals", "8 Queens problem", "Travelling Salesman Problem", puzzles like "Donald+Gerald=Robert").
See CUED's 3I1 lecture notes.

Appendix

red-black trees are used internally by the Standard Library as a basis for the associative container classes.
A red-black tree is a binary search tree with one extra attribute for each node: the colour, which is either red or black. It has the following red-black properties:
- Every node is either red or black.
- Every leaf is black.
- If a node is red, then both its children are black.
- Every simple path from a node to a descendant leaf contains the same number of black nodes.
Such trees can remain balanced - and thus guarantee O(logn) search times - in a dynamic environment. Or more importantly, since any tree can be re-balanced - but at considerable cost - can be re-balanced in O(logn) time.
AVL trees are a kind of balanced binary search trees, allowing for logarithmic time element find, insert, and delete operations. AVL trees are thoroughly discussed in D.E. Knuth's The Art of Computer Programming Vol. III, 6.2.3, pp 465ff. (from http://www.enteract.com/~bradapp/ftp/src/libs/C++/AvlTrees.html)
An AVL tree is a special type of binary tree that is always "partially" balanced. The criteria that is used to determine the "level" of "balanced-ness" is the difference between the heights of subtrees of a root in the tree. In an AVL tree the difference between the height of the right and left subtrees (or the root node) is never more than one. Whenever an item is inserted or deleted, a check is made to see if the tree has become unbalanced. If it has, balance is restored by performing a set of manipulations (called "rotations") on the tree. These rotations come in two flavors: single rotations and double rotations (and each flavor has its corresponding "left" and "right" versions).

Department of Engineering

IT Services