*  Using the Optimiser, Recoding, using the `Fast *
        *  Malloc' library and Profiling.                 *

                         ****** -O? *******
  There may be no way to "fix" the optimizer, but rather just choose
the most appropriate optimization for the function.  Lex and yacc outputs
are particularly prone to optimization time warps; so specify +O1 for them.

                         ******* Recoding ********

Techniques used are:  loop unrolling, localized use of register
variables, count down loops, and shift and modulo operators instead
of arithmetic operators.

Some general notes:

There are two major ways to unroll a loop with a variable remainder.
The method used here is to use two loops, one for the remainder and
one for the bulk.  The other method is to jump into a loop inside a
switch (legal but bizarre.)  The second method has the advantage of
not needing the "short" loop for the remainder.  However timing studies
show no real difference between the two methods.  The reason is that
the overhead of setting up the switch balances off with the higher cost
of the short loop.

Given two loops, one which is unrolled by a constant, and a short loop
handling the remainder, a loop unrolling factor of 8 is optimal under
most circumstances.

A count down loop saves several instructions in a CISC machine,
provided that (a) the loop variable is a register variable, and (b)
the optimation option is selected.  The savings are due to (1) the
comparison is against 0 rather than a constant, (2) CISC machines
typically have a branch on result of comparison, (3) optimization
and register declaration avoids a store of the loop variable.  For
some compilers there is also a savings by placing the calculation
of the limit in the first clause of the loop when the limit is a
complex expression.  It should be noted that the savings are only
signifigant in tight loops.

A count down loop should have one of the following two forms:

	for (index=(expression)+1;--index>0;) {body}
	for (index=(expression);--index>=0;) {body}

Plum Hall uses the first form; I use and recommend the second as
being clearer.  The second form will fail (infinite loop) if the
index is unsigned.  However the savings will be lost in any case
if the index is unsigned.  These forms should be used as is; the
loop optimization is lost if you don't predecrement in the test.

The macros do not include or exploit an alignment test on the grounds
that (a) the resulting code is not portable, (b) the overhead of the
tests removes the savings in short and medium length copies, (c)
many aligned copies can be caught by passing the proper type, and
(d) I didn't feel like coding it.

              ******** MALLOC ********

Some machines (yours included) have an alternate malloc library available.
If you use it you may see your application run twice as fast (no kidding).

With some of the older versions of malloc, all allocated blocks remained
linked in a single linear linked list.  Each time a new block is allocated,
all previously-allocated blocks are checked to see if they are now free
and big enough to satisfy the request.  X, and the intrinsics, and Motif,
all allocate lots and lots and Lots and LOTS of little chunks of memory.
Not only does that make the list very long, but think of how it affects
paging performance to fondle 98% of your data pages every other millisecond!

A/UX also suffers from an old malloc (although their alternate malloc
library is buggy).  Sun machines seem to have a decent standard malloc.

                        ****** PROF ********

"prof" should work pretty well if you compile all your sources with
"cc -p" and if you also LINK with "cc -p" in order to get function
counts for library routines.  If you have a good version of "prof"
(found on fewer and fewer systems as time goes on), you can obtain
a nice plot of an integral histogram over code space, which is very
handy for spotting "hot spots" where a lot of execution time is spent.

"Real time" depends on system load and other external factors, so it
is a poor indicator unless you are able to run your application on
an entirely idle system (no background daemons, spoolers, etc.).
"User time" is time spent executing in "user mode", which means within
the application proper, exclusive of system calls.  "System time" is
that spent by the operating system kernel on behalf of your process as
the result of system calls that the process makes.

The times vary slightly due to clock sampling (quantization) errors
and inherent inaccuracies in assigning time for certain system
services.  Note also that if the process invoked subprocesses, "time"
will include the subprocess times if and only if the subprocesses
terminate before the top-level process.

There are many methods for improving code efficiency.  Sometimes the
algorithm being used is not the most efficient method of solving the
problem, and changing to a better algorithm could speed things up
far more than any amount of low-level tweaking.  If the overall
algorithms are optimal, then once you have located the bottleneck(s)
via profiling you can consider detailed low-level ways to speed them
up.  There are many standard tricks, but sometimes it is possible to
dream up a new way to improve speed.  Jon Bentley's books are a good
starting point for learning such methods.

gprof is very good -- not only does it show you which routines are
gobbling CPU, but where they're called from so you can identify why
they're gobbling CPU.  (As in, "who is calling malloc() a million times
and why?")  It also produces a flat prof(1) style profile.  [For those
with systems without gprof, gcc supports it, and the GNU binutils come
with gprof]

Paul Haahr posted a lovely tool called lcomp to comp.sources.unix back
in October 1989 (Volume 20, Issue 82).  It runs on Vaxen and Sun3s
(should work on most of 68k family with pcc compilers) that does
instruction counting (based on Peter Weinberger's paper on "Cheap
dynamic instruction counting" in Bell Labs Tech J, OCt 84 -- the gold
book) so you can see how often each line of source was executed.  Suns
also have something called tcov that does something similar -- I
remember trying it and preferring lcomp.

gprof gets you down to the function level, lcomp helps peer inside the
function. One can always emulate lcomp by sticking in counters by hand
in the appropriate places.

Working out the order of a program's run time is a little difficult
for large programs.  One can try to approximate times for typical
programs by finding the most executed code, (gprof is a great help)
and working out rough relationships for the input and the time taken.
It's a good way to figure out what the code is doing and why it's
taking so long doing it.

Jon Bentley's "Writing efficient programs" (Prentice-Hall) is
recommended reading.  I liked Dennie Van Tassel's "Program Design,
Style, Efficiency and Testing" (?).  And you can always pick up the C
News paper from Usenix '87 (Winter?) or by anon. ftp from
cs.toronto.edu (doc/programming/cnews.{tbl.ms,doc,ps}) -- has some
interesting lessons on speeding up system programs.

If you work under UNIX, you may use cat, cp or dd to copy your
input file to /dev/null, resulting in a time I call "r_time".
Next take a file of the length of your output-file an do the
same, resulting in a time I call "r0_time". Finally just copy the
second file (not to /dev/null!), resulting in a time I call
"w_time". Now, the lowest bound - without any conversion -
for your problem is around "r_time + w_time - r0_time". (Here
I assumed that standard utilities like the one I mentioned
are well adapted to the system - eg. they use Standard-I/O
with a blocksize that matches the appropriate values for the
given file system - an assumption which may sometimes be wrong!)

Under UNIX prof is a good starting point. Note that prof gathers
the timing by statistical methods: At run-time in fixed intervals
the value of the program counter is sampled (only for profiled
programs, of course). This value is scaled and mapped into cells
of an array of counters, which is located in a special run-time
start-up module that is linked to your program if "-p" is specified
at this step. The special run-time start-up writes this array later
to the "mon.out" file. With the command prof the "mon.out" is mapped
mapped back to locations in the actual functions of the programs, using
the information of the name tables of "a.out" that are normally left
there by the linker for the purpose of debugging.

It's a little bit like if you stand up every sixty minutes from your
desk, walk to the window and make some notes of what you see outside.
After doing this for several weeks, your notes may be a good
approximation of what happens in the street outside your window.
But you may well have missed important events. Especially, if you
do this *exactly* every full hour, then noting the positions
of the hands of a clock out on the street will give a very wrong
impression of the outside world (if you base a description of the
outside world exclusively on your notes, not on common experience
with clocks :-)).

You should allways look to the results of prof with this in mind.
Chances are that very small functions are missed or timings are
mapped not to the true culprit, but in general you will get a good
approximation (Especially those "stroboscobic" effects I described at
the end of the last paragraph will occur very very very seldom.)

If you use prof you should also have a look at the number of calls
for every function. If you compile with the "-p" switch from the
source to the object, the compiler inserts a little extra code at
the start of each function to count the number of calls; you will
normally not have this extra code in the standard library, so that
you will see #calls-counts only for functions you've written yourself.
But on most systems there is a special compiled standard library
available. The usual name for this library is /lib/libc_p.a so you
can simply add "-lc_p" at the end of the command line in which you
call the link phase to produce the executable program, and all the
standard functions are taken from this special library instead from
the regular one.

As prof prints the number of calls for all functions, if a function seems
to consume much time but has a low number of calls (and vice versa)
this must not be wrong but should look suspicious to you and deserve
further investigation. Same if the order of the function in the source
file or the order in which the object files are given to the linker
change the timings substantially.

The user (and sys) time are also meassured with a (very similar)
statistical method as described above for the profiling (the only
difference is that no special run-time start-up module is required
as the program counter is not sampled; only the fact if the process
happens to be in "user-" or "system-mode" is noted with the time.)
The user(-mode) time accounts for the code *you* have written. You
have influence on this time by changing your code. The system time
accounts for the code the kernel spends for request from your
program(%). To change this time execute fewer system calls or make
the system calls have less work (or hire some people to rewrite
the kernal for you; as you need the kernal-source for this, you
should be prepared to have some $$ left to buy it from AT&T - I'd
say some hundredthousands should usually suffice :-)).

%: I'm aware that there is a small amount of error in this time due
to activity caused by interrupts. But this will only be of importance
in very special situations and need not be considered here.

	"Writing Efficient Programs" by Jon Bentley
	"Efficient C" by Thomas Plum and Jim Brodie


There are 3 levels of optimisation:-
	0 is always done.
	1 is selected using the +O1 compile flag.
	2 can be selected the  -O compile flag.

The optimiser assumes that only the current program will change
globals. If this isn't so you can use the `volatile' keyword to
alert the compiler to this fact. Eg
   volatile float fl
The compiler flag +OV makes all globals volatile, making the 
optimiser far more conservative.

Files or parts of files can be optimised at different levels using
#pragma OPT_LEVEL 1 (or 2)

By default, the compiler assumes that functions might alter global
variables. If you are sure that some functions don't alter variables, 
you can give the optimiser a hand by using
#pragma NO_SIDE_EFFECTS fn1 [,fn2]

NB: opted code on Release 6.5 and beyond assumes that fp regs %fp2-%fp7 
and %fpa3-%fpa15 aresaved/restored by the called routines.

These are the optimisations performed. The optimiser needs to be
multi-passing because often the more trivial operations provide
new opportunities for other operations. level 2 optimisation can
be time consuming, so only use it at pre-release stage. Level2
optimisations also increase the risk of bugs.   
      0				1			2
constant folding 	constant folding 	constant folding
simple reg assignment	simple reg assignment 	optimal reg assignment
                        peephole		peephole
						constant propagation
						deadcode elimination
						common sub expr elim
						deadstore elim
						copy propagation
						loop unroll

constant folding    : secs = 60*60      => secs=3600

peephole            : some standard short assembler code sequences compressed

constant propagation: a=1; b=a+1        => a=1; b=2;

deadcode elimination: code that can't be reached is removed. Some optimisers
                      remove code they think does nothing 
                      (eg for(i=0;i<10000;i++); ) but I think the HP C
                      optimiser is less severe.

common sub expr elim: a=x*y+z; b=x*y+w  => temp =x+y; a=temp+z; b=temp+w;

deadstore elim      : removes variables that aren't used

copy propagation    : {int a=3; b=a+1}  => {b=4}

loop unroll         : short loops are replaced by a list of explicit statements

better register use : rather than using the `register' hints in the
                      source, an analysis of the code is performed and
                      maximum use of registers made.