Fine Tune MPW
Volume Number: | | 6
|
Issue Number: | | 7
|
Column Tag: | | Mac Workshop
|
Fine Tuning Code With MPW
By Allen Stenger, Gardena, CA
Tuning with MPW
[Allen Stenger works on the F-15 radar software for Hughes Aircraft Co. His technical interests are software reliability, computer architecture, and computer languages. Programming the Macintosh has been his hobby for the past five years.]
The Macintosh Programmers Workshop documentation does a good job of describing how to use the performance analysis tools, but it does not explain what to do with the results. This article is a tutorial on how to tune a program with MPW, using the performance reports as a guide to which areas of the program are most in need of improvement.
For the example program we will use a dragon-curve drawer. This is an example in the Smalltalk books, and Listing 1 is a straightforward translation of the Smalltalk-80 program into SemperSoft Modula-2. This straightforward program takes about 8 seconds (477 ticks) to draw the dragon. This is not extremely slow, but one imagines that it could be done faster. In fact, we will be able to get a 16-times speedup with some simple transformations, and by being a little trickier we will get a 22-times speedup.
This is an artificial example, for a couple of reasons. First, for any imaginable use to which DrawDragon might be put, its present performance is adequate--there is no practical value to speeding it up. Second, the MPW tools would normally be applied to much larger programs, and are in some ways an overkill for this simple example.
Speculations
We will start out with some speculations on the causes of the slowness of the straightforward dragon. After doing the tuning we will revisit these and see how good our intuition was.
From inspecting the code, one might generate a list of possible problem areas such as the following:
Use of recursion. According to the folklore of programming, recursive programs are slow, and certainly the Dragon procedure seems to spend most of its time calling itself rather than doing any useful work.
Use of high-order language. Traditionally, in order to get the highest performance, one must write in assembly language. The SemperSoft compiler is a one-pass compiler, and presumably does little optimization.
Use of range-checking. This example is compiled with checks on, which in some programs can cause a significant slowdown.
Use of floating-point. This example was run on a Macintosh Plus, which has no floating-point hardware and does all floating point in software.
Graphics operations. These typically take a lot of CPU time. (There might not be much we can do about this one, since the programs whole purpose is to draw on the screen.)
Tuning -- Preparation
We can instrument the dragon program following the recipes in the MPW manual, Chapter 14 (the method is the same regardless of language, although the interface details vary slightly). Here we will only change the calling program, DrawDragon, and leave the other modules undisturbed. Since the performance measurements are done by random sampling of the program counter, the whole program will be instrumented merely by starting the measurement at the beginning of the run and ending it at the end - it is not necessary to modify each routine. The instrumented DrawDragon module is shown in Listing 2.
The performance analyzer yields an execution profile of the program, characterizing it by where it spends its time. It is usually only helpful to look at the top five routines - everything below that takes such a small amount of time that, even if we could eliminate a routine totally, it would have a negligible effect on the timing. Here, on our first run, we need only look at the #1 routine, since it takes 72.6% of the time (total time for the program is 493 ticks). This is the Pack4 routine, which is not in the Dragon program -- it is the SANE floating-point routines, as may be looked up in Inside Macintosh. So our first tuning step is to reduce the amount of time spent doing floating-point work.
Tuning -- First Pass
Some people (particularly FORTH fans) advocate doing no floating point at all, but using instead scaled integer arithmetic. For most programs, eliminating floating point altogether would be a drastic operation. For our little DrawDragon program it is not so bad, but lets see what else we could do first. Unfortunately the report does not break down the floating-point operations used, but by examining the code we see that each line drawn requires 2 integer-to-float conversions, 2 float-to-integer conversions, 3 floating-point multiplies, and 1 each sine and cosine. It seems likely that most of the floating-point time is spent on the sine and cosine, with the multiplies coming in a distant second. Further examination of the program reveals that sine and cosine are called with only 4 possible arguments (0, 90, 180, and 270 degrees), so this suggests the use of a look-up table storing the values of sine and cosine for these arguments.
Making this change and re-running the program and reports shows that this does cause a large speedup - the timing is now 86 ticks, or 5.7 times faster. The top item on the performance report is still Pack4 with 41.6%, so our next step is to try to eliminate still more floating-point operations.
Tuning -- Second Pass
After scrutinizing the improved program some more, we may remember that the reason we put in the sines and cosines in the first place was to handle drawing of lines at arbitrary angles. For the 4 angles that we actually use, the sines and cosines have integer values, so (for our particular case) there is no need for floating point anyway! We therefore make a simple re-coding of the Go procedure to do all calculations as integers. (Since the routine is now getting faster, there are not enough program-counter samples to get a good breakdown of where the time is spent, so we will also draw the dragon 10 times in a loop and divide by 10 to get the time per dragon.) Upon re-running this, we find that the time has decreased to 30 ticks (2.9 times faster), and the top 5 routines in the performance report are all QuickDraw routines and comprise 73.2% of the time. Further improvement seems to depend on inventing some way of drawing lines faster than QuickDraw can, which seems a dim hope.
Tuning -- Third Pass
Surprisingly, this is possible, in the special case we are dealing with (which has only horizontal and vertical lines). By trying some separate timing tests, it appears that our lines can be drawn in 1/3 less time by using FillRect rather than Line. Implementing this leads to a program which runs in 22 ticks (1.4 times faster), for an overall improvement over the straightforward Dragon of 22:1. This speedup requires changes to only one routine, Go. The revised Pen module is shown in Listing 3.
Speculations Inspected
The big winners among our speculations were floating point and graphics operations. Floating point took up most of the time in the original program, and by eliminating it we were able to get most of the speedup. Graphics operations take up most of the remaining time, and although we made some reduction in this, it seems unlikely that there is much more we can do. Possibly we could do something sneaky such as building the whole picture in memory without using graphics operations, then copying it to the screen with CopyBits; but this would be very complicated and perhaps not much of an improvement. (One of the hardest parts of tuning is knowing when to stop. Most programs can be improved indefinitely, but they gradually get more complicated and error-prone.)
The issue of recursion turned out to be a red herring. The total time spent in the Dragon procedure is only 3.9% of the total, and this includes the recursive calls. Recursion has a bad reputation, for two reasons. First is the bad recursion examples often given in textbooks (usually factorials or Fibonacci numbers), which can be done more simply and faster by iteration. Second, in some older languages and implementations, either there is no recursion in the language, so it must be simulated by the program (FORTRAN), or it is treated as a special case (JOVIAL, PL/I) and really is much slower because it requires a special dynamic allocation of variables rather than the normal static allocation. In more modern languages such as Pascal or Modula-2, all variables are dynamically allocated and recursive calls are no different (and therefore no slower) than other calls.
The impact of using a high-order language or of range checking varies a great deal depending on the program and on the language implementation. In this example most of the time was taken by things outside the generated code, which seems to be inherent in the particular problem being solved and not likely to be affected much by the implementation language. In the final version of the program, the Modula-2 routines take a total of only 13.3%, so re-writing in assembly or taking out the range checks could not possibly save more than this. The typical Macintosh application tends to consist mainly of calls to the Toolbox, interspersed with occasional calculations, so the timing of a tuned program tends to be dominated by the Toolbox time and does not depend greatly on the implementation language. For programs such as spreadsheets and compilers this is not true, since they really do spend a great deal of time on internal calculations, but it is true of most applications.
Conclusion
By applying the MPW performance analysis tools and making the improvements suggested by the program profile, we were able to speed up the dragon-drawer 22 times. The value of the execution profile is in telling us what not to look at. Most of a programs execution time is spent in a few very places, and we should concentrate our efforts on those few places. (In Quality Control this is called the principle of the vital few and the trivial many.) Profiles are more valuable for large programs, since there is more to ignore there. In any case, success in tuning depends on measurement.
For Further Reading
Jon Louis Bentley, Writing Efficient Programs. Prentice-Hall, 1982. An excellent book on tuning; of value both to the professional programmer and the hobbyist. Our Dragon example illustrates his principles Space-For-Time Rule 2 -- Store Precomputed Results (sine and cosine tables), and Procedure Rule 2 -- Exploit Common Cases (eliminate floating-point since not needed in our case, and use FillRect for faster horizontal and vertical line drawing). He gives many other principles which we did not have a chance to use. In the beginning of the book he goes through 10 steps of optimizing an Approximate Travelling-Salesman Tour to get an overall improvement of 17:1.
Adele Goldberg and David Robson, Smalltalk-80: The Language and Its Implementation. Addison-Wesley, 1983. The source of the Dragon program (on pp. 372-3).
Donald E. Knuth, An Empirical Study of FORTRAN Programs, Software--Practice and Experience, v. 1 (1971), pp. 105-133. Source of the term profile. This article is written from the viewpoint of a compiler developer, and considers several levels of optimization which can be applied to programs. Many examples of optimization.
Mike Morton, Faster Bitmap Rotation. MacTutor, v. 4 (1988), no. 11, pp. 86-90 (reprinted in The Definitive MacTutor, pp. 56-60). Another example of tuning, using only the total time of the routine as a measure and yielding a speedup of 3:1. Morton was careful to measure each attempted improvement to the routine; the exact method of measurement is not important, but you must measure.
Stephen Dubin, Thomas W. Moore, and Sheel Kishore, Using Regions in Medicine with C. MacTutor, v. 3 (1987), no. 10, pp. 27-31 (reprinted in The Essential MacTutor, pp. 146-150). Description of re-coding a routine to calculate the area of a region from Pascal or C to assembler, yielding a 1000:1 speedup.
James Plamondon, Finding The Area Of A Region in C (letter). MacTutor, v. 4 (1988), no. 4 (reprinted in The Definitive MacTutor, p. 699). Criticizes the previous reference for solving the wrong problem (i.e., working very hard to get a very good estimate of the area of a region drawn freehand). Gives a faster C routine for doing a less-accurate but adequate estimate. In most applications you have some leeway in the problem you solve, and you can use this to make the solution easier and faster. We did not take advantage of this in the DrawDragon program; one speedup we might have applied is to draw the line only one pixel wide (which is about twice as fast as drawing it 4 pixels wide). Bentley (reference above) also considers this in his Approximate Travelling-Salesman Tour -- since it is only an approximate solution anyway, there is little harm in going from floating-point to slightly less accurate scaled fixed point numbers.
Source Listings
Listing 1 - Straightforward Dragon
(******************************************************)
(* *)
(* file: DrawDragon.m *)
(* *)
(* Main program to test Dragon method. *)
(* *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* *)
(* Allen Stenger August 1989 *)
(* *)
(******************************************************)
MODULE DrawDragon;
FROM InOutIMPORT WriteLong, WriteString, Read;
FROM InsideMac IMPORT TickCount;
FROM DragonModuleIMPORT Dragon;
FROM PenIMPORT Home;
VAR
oldTime,
newTime: LONGINT;
ch : CHAR;
BEGIN
oldTime := TickCount();
Home;
Dragon( 8 );
newTime := TickCount();
WriteString( Run time is );
WriteLong( newTime - oldTime, 6 );
WriteString( -- press space to exit );
Read( ch );
END DrawDragon.
(******************************************************)
(* *)
(* file: DragonModule.d *)
(* *)
(* Implementation of Dragon method from *)
(* Goldberg and Robson, *)
(* Smalltalk-80: The Language and Its *)
(* Implementation, pp. 372-3. *)
(* *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* *)
(* Allen Stenger August 1989 *)
(* *)
(******************************************************)
DEFINITION MODULE DragonModule;
PROCEDURE Dragon( order : INTEGER );
END DragonModule.
(******************************************************)
(* file: DragonModule.m *)
(* Dragon method - see definition module for *)
(* description. *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* Allen Stenger August 1989 *)
(******************************************************)
IMPLEMENTATION MODULE DragonModule;
FROM PenIMPORT Go, Turn;
PROCEDURE Dragon( order : INTEGER );
BEGIN
IF order = 0
THEN
Go( 10 );
ELSE
IF order > 0
THEN
Dragon( order - 1 );
Turn( 90 );
Dragon( 1 - order );
ELSE
Dragon( -1 - order );
Turn( -90 );
Dragon( 1 + order );
END; (* IF *)
END; (* IF *)
END Dragon;
BEGIN
END DragonModule.
(******************************************************)
(* file: Pen.d *)
(* Implements some of the methods for the Pen class *)
(* in Smalltalk-80. *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* Allen Stenger August 1989 *)
(******************************************************)
DEFINITION MODULE Pen;
(*
Go in the current direction the specified distance
(units of pixels).
*)
PROCEDURE Go( distance : CARDINAL );
(*
Change the current direction by turning degrees
(positive degrees = clockwise).
*)
PROCEDURE Turn( degrees : INTEGER );
(*
Move to original pen position.
*)
PROCEDURE Home;
END Pen.
(******************************************************)
(* file: Pen.m *)
(* Pen methods - see definition module for *)
(* descriptions. *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* Allen Stenger August 1989 *)
(******************************************************)
IMPLEMENTATION MODULE Pen;
FROM Terminal IMPORT ClearScreen;
FROM InOutIMPORT Write;
FROM MathLib0 IMPORT sin, cos, entier, real, pi;
FROM InsideMac IMPORT Line, MoveTo, PenSize;
CONST
DegreesToRadians = pi / 180.;
(* conversion factor *)
VAR
currentDegrees : [0..359];
(* direction we are
facing -- 0 = right. *)
PROCEDURE Go( distance : CARDINAL );
VAR
currentRadians : REAL;
realDistance : REAL;
BEGIN
currentRadians := DegreesToRadians
* real( currentDegrees );
realDistance := real( distance );
Line( entier( realDistance
* cos( currentRadians ) ),
entier( realDistance
* sin( currentRadians ) )
);
END Go;
PROCEDURE Turn( degrees : INTEGER );
VAR
tempDegrees: INTEGER;
BEGIN
tempDegrees := INTEGER(currentDegrees) + degrees;
DEC( tempDegrees, 360 * ( tempDegrees DIV 360 ) );
IF tempDegrees < 0
THEN INC( tempDegrees, 360 );
END; (* IF *)
currentDegrees := tempDegrees;
END Turn;
PROCEDURE Home;
BEGIN
MoveTo( 110, 200 );
currentDegrees := 270; (* facing up *)
END Home;
BEGIN
ClearScreen;
Write( 0C ); (* graphics initialization kludge *)
PenSize( 4, 4 );
END Pen.
Listing 2 - Instrumented DrawDragon module
(******************************************************)
(* Main program to test Dragon method. *)
(* This version includes MPW performance analyzer *)
(* calls. *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* Allen Stenger August 1989 *)
(******************************************************)
MODULE DrawDragon;
FROM InOutIMPORT WriteLong, WriteString, Read;
FROM InsideMac IMPORT TickCount;
FROM PerformIMPORT InitPerf, PerfControl,
PerfDump, TermPerf,
TP2PerfGlobals;
FROM DragonModuleIMPORT Dragon;
FROM PenIMPORT Home;
VAR
oldTime,
newTime: LONGINT;
ch : CHAR;
thePerfGlobals : TP2PerfGlobals;
junk : BOOLEAN;
BEGIN
(* Initialize performance measurement *)
thePerfGlobals := NIL;
IF InitPerf(
thePerfGlobals, (* measurement block *)
20, (* sample interval *)
8,(* bucket size *)
TRUE, (* measure ROM *)
TRUE, (* measure application *)
CODE,(* resource type to
measure *)
0,(* ROM ID *)
, (* ROM name *)
FALSE, (* measure RAM misses *)
0,0,0 (* for RAM misses *)
)
THEN
ELSE WriteString( Initialization failed );
END; (* IF *)
(* Start performance measurement *)
junk := PerfControl( thePerfGlobals, TRUE );
oldTime := TickCount();
Home;
Dragon( 8 );
newTime := TickCount();
(* End performance measurement *)
IF 0 = PerfDump( thePerfGlobals,
DrawDragon.dump,(* dump file for
results *)
FALSE, (* histograms *)
0 (* histograms *)
)
THEN
ELSE WriteString( PerfDump failed );
END; (* IF *)
TermPerf( thePerfGlobals );
WriteString( Run time is );
WriteLong( newTime - oldTime, 6 );
WriteString( -- press space to exit );
Read( ch );
END DrawDragon.
Listing 3 - Tuned Pen module
(******************************************************)
(* file: Pen.m *)
(* Pen methods - see definition module for *)
(* descriptions. *)
(* This version contains improvements suggested by *)
(* running MPW performance analysis tools. *)
(* Written in SemperSoft Modula-2 v.1.1.2 *)
(* Allen Stenger August 1989 *)
(******************************************************)
IMPLEMENTATION MODULE Pen;
FROM Terminal IMPORT ClearScreen;
FROM InOutIMPORT Write;
FROM InsideMac IMPORT FillRect, SetRect;
FROM InsideMac IMPORT black, Rect;
VAR
currentDegrees : [0..359];
(* direction we are
facing -- 0 = right. *)
currentH,
currentV : CARDINAL;(* where situated *)
PROCEDURE Go( distance : CARDINAL );
CONST
LineSize = 4;
VAR
lineRect : Rect;
BEGIN
CASE currentDegrees OF
0:SetRect( lineRect, currentH, currentV,
currentH + distance + LineSize,
currentV + LineSize );
INC( currentH, distance ); |
90: SetRect( lineRect, currentH, currentV,
currentH + LineSize,
currentV + distance + LineSize
);
INC( currentV, distance ); |
180: SetRect( lineRect, currentH - distance,
currentV,
currentH + LineSize,
currentV + LineSize );
DEC( currentH, distance ); |
270: SetRect( lineRect, currentH,
currentV - distance,
currentH + LineSize,
currentV + LineSize );
DEC( currentV, distance );
END; (* CASE *)
FillRect( lineRect, black );
END Go;
PROCEDURE Turn( degrees : INTEGER );
VAR
tempDegrees: INTEGER;
BEGIN
tempDegrees := INTEGER(currentDegrees) + degrees;
DEC( tempDegrees, 360 * ( tempDegrees DIV 360 ) );
IF tempDegrees < 0
THEN INC( tempDegrees, 360 );
END; (* IF *)
currentDegrees := tempDegrees;
END Turn;
PROCEDURE Home;
BEGIN
currentH := 110;
currentV := 200;
currentDegrees := 270; (* facing up *)
END Home;
BEGIN
ClearScreen;
Write( 0C ); (* graphics initialization kludge *)
END Pen.