Data Structures and C++
Introduction to the STL
Many books exist dealing with sorting and organising data. With the emergence of languages that support data abstraction, people tried to generalise data structures. The Smalltalk language introduced some useful ideas. These have been developed and brought into the mainstream with C++'s STL (Standard Template Library) The STL includes containers that hold data. Containers are grouped into 3 types
 Sequential  the order of the elements matter (e.g. vector)
 Associative  the order of the elements doesn't matter(e.g. set)
 Adapters  containers that are modifications of other containers (e.g. stack)
These containers have many member functions in common. For example, s.size() gives the number of elements in the container s if s is a set, a list or even a string of characters. The STL also includes algorithms that operate on the containers. To be of maximum use these algorithms need to be able to operate on as many types of containers as possible, so each container has member functions that give algorithms access to the data. For example, s.begin() will always return an Iterator (like a pointer) to the first element of a container s. By hiding the implementation details of containers in this way, highlevel generic algorithms can be written which will not only work with the standard containers but also with any new containers that a programmer might write.
The STL doesn't contain a tree container, but if you wrote one so that it had the standard hooks that algorithms require, then algorithms like copy etc. should work with it. Clearly a tree container would require some special routines that other containers wouldn't support, but C++ makes it easy to add functionality in, building on what you already have. Before we look at trees, let's see how to use some simpler containers.
Queues
#include <deque> #include <algorithm> #include <iostream> using namespace std; int main() { deque<int> Q; // An empty, elastic, container ready to hold ints. // A deque is a doubleended queue. Queues (unlike lists) are // data structures where you wouldn't expect to insert new items // anywhere else but at the ends Q.push_back(0); // Many containers have a push_back // routine. It's a way to add elements // to a container in a safe way  the // container will grow if it's not big enough. Q.push_back(2); Q.push_back(1); // the next line is a bit of magic you'll have to take on trust // for now  it "copies" (i.e. prints) the elements of Q to // the output stream. A simple "for" loop could be used instead // but it's tempting to avoid using "for" altogether. copy(Q.begin(), Q.end(), ostream_iterator<int>(cout, " ")); // Output: 0 2 1 cout <<endl; // Now let's sort! sort(Q.begin(), Q.end()); // the arguments we've given to sort mean that the whole queue // will be sorted, and that the result will be copied back to // the original queue. sort uses the default sorting criterium // for integers, and can use the swap function of Q to switch // elements around. copy(Q.begin(), Q.end(), ostream_iterator<int>(cout, " ")); // Output: 0 1 2 cout <<endl; }
To create a queue of strings rather than a queue of integers, we don't need to change much. This ease of coping with different data types will be useful when we come to create trees.
#include <deque> #include <string> // added so that we can use strings #include <algorithm> #include <iostream> using namespace std; int main() { deque<string> Q; // changed Q.push_back("standard"); Q.push_back("template"); Q.push_back("library"); // the next line has been changed to support strings rather than integers copy(Q.begin(), Q.end(), ostream_iterator<string>(cout, " ")); cout <<endl; // Now let's sort! sort(Q.begin(), Q.end()); // sort uses the default sorting criterium for strings  alphabetical // the next line has been changed to support strings rather than integers copy(Q.begin(), Q.end(), ostream_iterator<string>(cout, " ")); cout <<endl; }
As well as being able to have queues of standard items like strings and integers, we can also invent our own types and have queues of those. The main problem is that when we sort we have to specify the sort criteria for our own types. It's sufficient to specify what the '<' operator should do for your own type. Here's an example /p>
#include <deque> #include <algorithm> #include <iostream> using namespace std; class simple { public: int value; // redefine the '<' operator, so when 2 simple objects are // compared, the 'value' fields of the objects are compared. const bool operator<(const simple &rhs) const { return this>value < rhs.value; } }; int main() { deque<simple> Q; // An empty, elastic, container. simple s; s.value=0; Q.push_back(s); s.value=2; Q.push_back(s); s.value=1; Q.push_back(s); for(int i=0;i<Q.size(); i++){ cout << Q[i].value; cout << " "; } cout <<endl; // Output: 0 2 1 // Now let's sort! sort uses the '<' operator that we created in the 'simple' class sort(Q.begin(), Q.end()); for(int i=0;i<Q.size(); i++){ cout << Q[i].value; cout << " "; } cout <<endl; // Output: 0 1 2 }
Trees
There are many types of tree structures (threaded trees, Splay trees, 23 trees, etc). The choice depends on many factors  how dynamic the data is, whether bestcase or worstcase behaviour is the most important, whether you want to avoid recursion, etc. Trees are often a natural choice (e.g. when writing a chess program) but they are also effective in many situations where data that's received sequentially can be beneficially stored in a more organised fashion  e.g. parsing mathematical expressions, parsing sentences, reading words that later need to be searched, or retrieved in alphabetical order.
Binaries trees (where each node has at most 2 children) are commonly used  not because they most closely model reality, but because they're computationally and mathematically more easy to deal with.
Suppose that a program were to be given 10,000 words (nonunique) and had to list the unique words used. Suppose also that memory was at a premium, so that storing every nonunique word is to be avoided. Each time a word is read the program needs to check if the word's new and store it only if it is. If the words are stored in an array alphabetically, searching will be faster, but adding a word to the start of the array will be slow because subsequent words will have to be shifted.
Using a binary tree will speed things up. Each time a word is read, the program starts at the top of the tree. If the node is empty, the word is stored there, otherwise the left or right branch is taken according to whether the word comes before or after the word stored in the node. For example, suppose the first 5 words were "standard", "template", "library", "data" and "structures". "standard" would be stored in the top node. "template" would be stored in the rightchild node of the top node. "library" would be stored in the leftchild node of the top node.
standard / \ library template
"data" comes before "standard" so the left link will be taken. "data" comes before "library", so "data" will be stored in the leftchild node of "library". The final tree looks like this
standard / \ library template / / data structures
If this tree remains balanced, then it could fit 10,000 words within 15 layers, so a wordsearch would require only 15 comparisons at most. Insertions (new nodes) are cheap too. Contrast this with the worst case array scenario, where 10,000 comparisons might need to be made before a search is complete!
However, if the words are supplied in alphabetical order, this treebased method suffers  each new word will add a rightnode, deepening the tree each time, making searching as slow as with arrays. Various techniques exist to combat this. The benefits of these techniques are so enormous that simple binary trees aren't used in big programs.
 redblack trees are used internally by the STL as a basis for the associative
container classes.
(from http://swww.ee.uwa.edu.au/~plsd210/ds/red_black.html)
A redblack tree is a binary search tree with one extra attribute for each node: the colour, which is either red or black. It has the following redblack properties: Every node is either red or black.
 Every leaf is black.
 If a node is red, then both its children are black.
 Every simple path from a node to a descendant leaf contains the same number of black nodes.

AVL trees are a kind of balanced binary search trees, allowing for logarithmic time element find, insert, and delete operations. AVL trees are thoroughly
discussed in D.E. Knuth's The Art of Computer Programming Vol. III, 6.2.3, pp 465ff.
(from
http://www.enteract.com/~bradapp/ftp/src/libs/C++/AvlTrees.html)
An AVL tree is a special type of binary tree that is always "partially" balanced. The criteria that is used to determine the "level" of "balancedness" is the difference between the heights of subtrees of a root in the tree. In an AVL tree the difference between the height of the right and left subtrees (or the root node) is never more than one. Whenever an item is inserted or deleted, a check is made to see if the tree has become unbalanced. If it has, balance is restored by performing a set of manipulations (called "rotations") on the tree. These rotations come in two flavors: single rotations and double rotations (and each flavor has its corresponding "left" and "right" versions).
Tree Traversal
There are 2 basic approaches to bruteforce tree searching  breadthfirst and depthfirst. With breadthfirst you start at the top node then visit all its childen before visiting its children's childen  i.e. you explore the tree layer by layer. With depthfirst searching you keep going down, backtracking when you can go no further. The purpose of the search and the shape of the tree affect the choice of approach. For example, if you're playing chess and you don't have much time to think, it's better to do a breadthfirst search rather than looking at one move (which might be bad anyway) at great depth.
With either method there's a choice of ways to order operations on the node.
 preorder  node, leftbranch, rightbranch
 inorder  leftbranch, node, rightbranch
 postorder  leftbranch, rightbranch, node
As an example consider the following tree, which represents a mathematical expression.
 / \ * * / \ / \ 2 + 3  / \ / \ x 5 y 4
If you do a depthfirst search, processing each leftbranch and rightbranch of a node before printing the node's contents you'll get
2 x 5 + * 3 y 4  * 
(check to see if you agree with this!) which is the expression in Reverse Polish notation (used by older calculators and the postscript language) but if you deal with nodes differently (still depthfirst search, but this time processing the leftbranch, printing the contents of the node, then processing the rightbranch) you'll get something much closer to the normal representation.
The depthfirst algorithm can be described more formally as follows
 Form one element queue Q consisting of the root node
 Until the Q is empty or the goal has been reached, determine if the first element in Q is the goal.
 If it is, do nothing
 If is isn't, remove the first element from Q and add the first element's children, if any, to the front of the queue (so that children are searched before siblings)
 If the goal is reached, success; else failure
Try this algorithm by hand on the following tree
A / \ B C / \ / \ D E F G / \ / \ H I J K
The STL makes it relatively easy to translate this algorithm into code nearly linebyline. If each vertex has a list of subvertices called child then
list<Vertex > q; q.push_back( start );
creates and initialises a list, and
while( !q.empty( ) ){ Vertex v = q.front( ); // get the first item if (!v.goal_reached()) { q.pop_front( ); // remove it from the list for( int i = 0; i < v.child.size( ); i++ ) q.push_front(v.child[i]); // push each child to the front of the queue } }
modifies the list.
It's sometimes possible to sort this list of nodes to be visited so that the most promising nodes are visited first.
 Hill climbing  This method tries the locallybest candidate. It's like trying to find the highest peak in the highlands by always taking the steepest slope from your present location. In terms of the algorithm above, it means sorting the children before adding them to the front of the queue.
 Bestfirst  With this method, it's not just the children which are sorted; the whole queue is sorted.
Suppose you started at point A of this 2D landscape.
 With the Hill climbing algorithm you'd first go to B (the highest available point) then C then D, before backtracking to A and going to E then F.
 With the Bestfirst algorithm you'd first go to B, but then you'd go to E then F.
The code for these methods follows the stages of the algorithm closely. We've seen earlier how sort can be used to sort a list of data structures like Vertex
deque<Vertex> q; deque<Vertex> new_wave; while( !q.empty( ) ){ Vertex v = q.front( ); // get the first item if (!v.goal_reached()) { q.pop_front( ); // remove it from the list for( int i = 0; i < v.child.size( ); i++ ) new_wave.push_front(v.child[i]); // push each child onto a queue // Now decide whether to sort this new queue before adding // it to the main queue, or add it to the main queue before // sorting the whole queue if (mode==hill_climbing) { sort(new_wave.begin(),new_wave.end()); while(!new_wave.empty()){ q.push_front(new_wave.back( )); new_wave.pop_back( ); } } else { while(!new_wave.empty()){ q.push_front(new_wave.back( )); new_wave.pop_back( ); } sort(q.begin(),q.end()); } } }
Branch and Bound methods break the set of feasible solutions into smaller subsets, and find for each subset a lower bound (as high as possible) on the function one's trying to maximise in the hope of isolating the optimal solution. These subsets often take the form of subtrees. If the subsets are well chosen, and if the bounds are easy to calculate, these methods can greatly reduce the number of nodes that need to be explicitly searched.
Many tree traversal strategies require the program to remember previous nodes in the path so that backtracking is possible. This (and recursion) can use up a lot of memory. With threaded trees each node knows its sucessor  with a depthfirst search parents lead to their first child which in turn leads to siblings. The last sibling leads to the next node to be visited. Maintaining such a tree requires more work when nodes are added or removed, and makes traversal less flexible but with each node having a successor, it becomes more possible to treat the Graph as an STL container on which iterators can operate.
Graphs
A tree is a special case of the more general graph (or net). Graphs have many uses from network analysis to spreadsheet operation. Graphs have vertices (or nodes) and edges (which can be oneway or undirected). Edges can have a cost associated with them, and the cost may depend on the direction taken along the edge. Popular graphrelated puzzles include
 Finding whether one can traverse all edges without traversing any twice (an Euler Path).
 The Travelling Salesman Problem  how to visit all the nodes at a minimal cost
 Finding the shortest (least cost) path between 2 vertices
 Finding the "minimal spanning tree"  finding a tree (with the leastcost edges) that includes all nodes
More formally, a graph is a pair (V,E), where V is a finite set and E is a binary relation on V. V is called a vertex set whose elements are called vertices. E is a collection of edges, where an edge is a pair (u,v) with u,v in V. In a directed graph, edges are ordered pairs, connecting a source vertex to a target vertex. In an undirected graph edges are unordered pairs and connect the two vertices in both directions, hence in an undirected graph (u,v) and (v,u) are two ways of writing the same edge.
If some edge (u,v) is in graph , then vertex v is adjacent to vertex u. In a directed graph, edge (u,v) is an outedge of vertex u and an inedge of vertex v. In an undirected graph edge (u,v) is incident on vertices u and v. In a directed graph, the number of outedges of a vertex is its outdegree and the number of inedges is its indegree. For an undirected graph, the number of edges incident to a vertex is its degree.
A path is a sequence of edges in a graph such that the target vertex of each edge is the source vertex of the next edge in the sequence. If there is a path starting at vertex u and ending at vertex v we say that v is reachable from u. A path is simple if none of the vertices in the sequence are repeated. A path is a cycle if the first and last vertex in the path are the same. A graph with no cycles is acyclic.
Graph Traversal
Graphs can be traversed much as trees can (depthfirst, breadthfirst, etc), but care must be taken not to get stuck in a loop  trees by definition don't have cycles, and in a tree there's always only one path from the root to a node whereas in a graph there may be many paths between any pair of nodes.
The following directed graph has 6 nodes. The DijkstraMoore algorithm can be used to find the shortest path from a given node to each of the others as long as costs aren't negative. A breadthsize search from the first node radiates out across the graph. If it reaches an unvisited node, it sets its distance from the first node to be the distance of the previous node from the first node, plus the length of the edge from the previous node. If a node's already been visited, it will already have been given a distance from the first node, but this might not be the shortest distance  the current path might be shorter. So a comparison is done, and the node's distance is set to the minimum.
For example, in the graph below, suppose that A was the source node. In the first stage, node B's distance would be set to 7 and node C's distance to 1. During the second stage, routes from nodes B and C would be explored. When node B is reached from node C, the distance is 1+5, so node B's distance is reset to 6 from its original 7. As the algorithm progresses, node B's distance will later be reset to 5, which is the final value for node B's distance. The algorithm terminates after 5 stages because by then all of the routes will have been traversed  routes can't have more than number_of_vertices  1 edges.
Implementing Graph Algorithms
Graphs provide useful source material for practising C++  inheritance provides a useful mechanism for integrating node, edge and graph structures, and the STL takes away much of the chore. See the References below for a list of books and online courses. Below is one way that graph algorithms could be implemented
A graph is an STL vector of vertices. Each vertex has an STL vector of edges. For flexibility the details of the search task are separate from the configuration of the graph.
enum ordering {preorder, inorder, postorder}; enum search_mode {breadth_first, depth_first, hill_climbing, best_first, Astar, shortest_paths, min_spanning_tree}; class Search_Description { public: ordering order; search_mode searching; string startnode; string goalnode; bool printnode; // whether the visited nodename is printed };
The vertices and edges can have associated information (cost, names, etc). The graph and search specifications can be read from a file or be specified interactively. The minimumdistance problem above can be set up using the following file
A B 7 anon A C 1 anon B D 4 anon B F 1 anon C B 5 anon C E 2 anon C F 7 anon E B 2 anon E D 5 anon F E 3 anon SEARCHMODE shortest STARTNODE A
Most of the lines specify directed edges with weights (the "anon" means that the edges don't have names). Vertices are automatically created on demand. The output is
B is 5 from the source. Shortest path is A to C to E to B C is 1 from the source. Shortest path is A to C D is 8 from the source. Shortest path is A to C to E to D F is 6 from the source. Shortest path is A to C to E to B to F E is 3 from the source. Shortest path is A to C to E
The hill climbing problem above can be set up with the following file
A B 1 anon B C 1 anon C D 1 anon A E 1 anon E F 1 anon NODEVALUE A 0 NODEVALUE B 2 NODEVALUE C 0 NODEVALUE D 1 NODEVALUE E 1 NODEVALUE F 3 // For bestfirst change the next line to SEARCHMODE bestfirst SEARCHMODE hill STARTNODE A GOALNODE F
Graph member functions include
 shortestpaths( const string & startnodename)  lists shortest paths from given node.
 min_spanning_tree()  prints the edges of the minimum weight spanning tree
 astarsearch(const string & startName, const string & goalnode)  Uses an A* search to find the goal node. Each node is assumed to have been given x and y coordinates (by using the NODECOORDS directive in the input file). The admissable heuristic used is distance travelled so far + euclidean distance from current node to specified goal node
 cyclecheck()  checks if the graph contains a cycle
 traverse(const search_mode & traverse_mode, const string & sourceName, const string &goalnode)  performs breadfirst and depthfirst traversals
 informedsearch(const search_mode &,const string & sourceName, const string &goalnode)  does hillclimbing and bestfirst.
Graphs can also be created programatically. Here's an example 
#include "graphs.cc" int main( int argc, char *argv[ ] ) { Graph g; Search_Description s; vector<string> nodenames; nodenames.push_back("A"); nodenames.push_back("B"); nodenames.push_back("C"); nodenames.push_back("D"); nodenames.push_back("E"); nodenames.push_back("F"); for (int i=0;i<6;i++) for (int j=i+1;j<6;j++) g.addEdge(nodenames[i],nodenames[j],i+j,"anon"); g.shortestpaths("B"); return 0; }
Exercises
 Design (without writing code) a class for binary trees, describing the variables and member functions.
References
Books
 "Designing Components with the C++ STL: A New Approach to Programming", 2/e, Ulrich Breymann, Publisher: Addison Wesley, ISBN: 0201674882. In section III (BEYOND THE STL: COMPONENTS AND APPLICATIONS) it develops a Graph class along with some algorithms.
 "Data Structures with STL", Murray and Pappas, Publisher: Prentice Hall, ISBN: 0130289272. Ends with graphs, looking at DepthFirst and BreadthFirst searching, Floyd's Algorithm Implementation with STL Vectors, etc.
 "Graphs and Algorithms", Gondran and Minoux, Wiley, 1984. (in the CUED library). Maths.
 "Data Structures via C++", Berman, A.M., OUP, 1997. (in the CUED library). No STL, but has trees and graphs.
Code and documentation
 The Boost Graph Library  a sophisticated STLbased library complete with documentation
 http://www.mrs.umn.edu/~mcphee/Courses/1996_97/Spring/CSci_3190/Assignments/Graph/
 http://wwwca.informatik.unituebingen.de/dagstuhl/gpdag_28.html
 http://www.bioinformatics.ucla.edu/events/202/background.html
 http://www.cs.hope.edu/~alganim/ccaa/graph.html
 http://www.student.seas.gwu.edu/~idsv/
 "A Language and a Program for stating and solving combinatorial problems", Laurière, J.L., Artificial Intelligence, Vol. 10, n. 1, pp. 29127 (1978)  describing ALICE, a language and system that can solve a wide range of problems ("Missionaries and Cannibals", "8 Queens problem", "Travelling Salesman Problem", puzzles like "Donald+Gerald=Robert").
 See CUED's E6 lecture notes.
See Also
 From An interview with
A. Stepanov  cowriter of the STL
Question: Could you explain to a modest C++ programmer what Generic Programming is, what is the relation of Generic Programming with C++ and STL, and how did you come to use Generic Programming in a C++ context?Answer: Generic programming is a programming method that is based in finding the most abstract representations of efficient algorithms. That is, you start with an algorithm and find the most general set of requirements that allows it to perform and to perform efficiently. The amazing thing is that many different algorithms need the same set of requirements and there are multiple implementations of these requirements. The analogous fact in mathematics is that many different theorems depend on the same set of axioms and there are many different models of the same axioms. Abstraction works! Generic programming assumes that there are some fundamental laws that govern the behavior of software components and that it is possible to design interoperable modules based on these laws. It is also possible to use the laws to guide our software design. STL is an example of generic programming. C++ is a language in which I was able to produce a convincing example.
Question: I think STL and Generic Programming mark a definite departure from the common C++ programming style, which I find is almost completely derived from SmallTalk. Do you agree?
Answer: Yes. STL is not object oriented. I think that object orientedness is almost as much of a hoax as Artificial Intelligence.