Department of Engineering

IT Services

    A case history: linkscan

The value of many C++ features only becomes clear when you write big programs. In March 1999 I wrote a program called linkscan. It's big enough to illustrate useful programming technique and conceptually simple enough to serve as a teaching aid. I wrote it in C (2000 lines) then rewrote it in C++ to compare the 2 languages.

linkscan checks a tree of HTML files to see if the links they contain are still valid and to check whether all the files are linked to. While it's doing this it gathers statistics. Documentation and sample output is available online.

The program needed to be

  • Portable - it's required to work on different types of Unix (at least!)
  • Efficient with time and memory - able to cope with a tree containing thousands of files
  • Customisable - flexible enough to suit various needs

Classes

Often the nouns that are used to describe what the program does correspond to the Classes. In this case, links and files are the prime candidates. It's suggested that

  • there should be a 1 to 1 correspondence between programming entities and real-world concepts
  • relationships between objects should be shown explicitly by inheritance or membership.
  • an ``is-a'' relationship should be modelled by inheritance, a ``has-a'' relationship by composition, a ``uses-a'' relation by a function call, and a ``knows-a'' by a pointer.

I wondered whether to have separate structures for internal links/files and external files. They share many fields. All the files have corresponding URLs, but not all the URLs have corresponding files, which suggests that the file class shown be derived from the URL class.

Rather than each file object containing the full name of the file, I decided that it should contain the basename plus a pointer to a directory object. This saves lots of space in situations where many files are in one directory.

So in the end I decided that URL is-a File which in turn has-a Directory. The report is the result of performing an operation on the tree and depends on the configuration file.

The file objects are going to be kept in a non-sorted order. In the C version when some have been selected to be printed out in a list, the filenames are copied into a temporary array of structs and sorted using qsort(). This isn't type-safe, and the comparison function won't be inlined as it is called by address. So in the C++ version I added the selected items to maps or multimaps.

Code Organisation

As usual, declarations should go in *.h files and definitions in *.cc files. In the C++ version I created extra *.h files to improve the code structure. Note also that the source code of templates needs to be available to any files than instantiates templates. New compilers let you use export in a source file to make a template definition available elsewhere as long as the declaration is in an include file.

Each class should have a print member function, replacing the original C functions to print names.

Testing

I created some text files (some with no links, etc) and directories (some empty). Having a configuration file was useful for testing as well as for future users.

Comparison of the C and C++ versions

The C version had 4 source files (about 40 routines) and one include file. It defined structures for 3 mappings, and structures for directory and file information. In the C++ version the mappings were done using the Standard Library and routines like printURL became member functions. String manipulation was simplified (and, for concatenation at least, probably speeded up) by using strings.

The C++ binary is 3 times the size of the C binary.