Compiler Compare
Volume Number: | | 5
|
Issue Number: | | 8
|
Column Tag: | | Jörg's Folder
|
Compiler Comparison
By Jörg Langowski, MacTutor Editorial Board
Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.
Code optimization
You will have noticed the change in the columns title: a recent reader survey has shown that Forth and Basic are the two languages that our readers would most like to see less of in MacTutor. Thats a shame, but were responsive (at least we try)
So, from now on my monthly column will have a wider scope. As you might have seen, I have very often used Forth as a vehicle to explain general concepts of Macintosh programming. Since many subscribers dont seem to be happy with Forth - in fact, people often have asked about things that had been explained in a Forth column, but they just hadnt read. The column about the Notification Manager in V5#6 is a good example: at the bottom of page 42, one of the ideas to be written was explained as: An example that places a Notification Manager request in low System memory, and starts a Timer routine ; this is just the example that was in the Forth Forum that covered the Notification Manager. Anyway, Ill try to use other vehicles to convey the message from now on. Such as assembly, or maybe even C. Mach2 Forth still is a very good assembly-language development system because of its interactivity. Of course, you cannot create complicated macros, or structures, and have to resort to Forth code for those purposes. Ill still inform you about interesting things I come across on the Forth scene, but this wont be an exclusive Forth column anymore. Emphasis will be on two things: basic system-level things such as drivers, trap patches, INITs, network, new managers as they come up; and - on the other side of the spectrum - object-oriented programming in C++.
Apple will Real Soon Now release a C++ under MPW, and well hopefully have a pre-release by the time you read this. C++ is a very interesting language, much more than a simple extension of C; reading Stroustrups book I got this feeling of yes, thats how one should have done it in the first place that I had 15 years ago when all I knew was Algol 60, and came across the description of Algol 68. Sadly enough, Algol 68 never really caught on; hopefully, C++ will. The C++ column will start with the next issue; including program examples if we get the pre-release soon enough, just dry swimming if not.
This month, well talk once more about one of my favorite subjects, number crunching, speed (or the lack of it), and intelligence in compilers (or the lack of it).
A matrix multiplication routine
Since I am doing more and more everyday computation (mostly Fortran) on the MacII, Im obviously interested in a good optimizing compiler. Now, a standard trick that every decent compiler should have in its repertoire is the elimination of constant expressions from loops, or assignment of array elements that are not dependent on the loop index to intermediate variables.
Imagine my surprise when I found out that (in Language Systems Fortran 1.2.1) I could speed up a loop that looked like this:
do k=1,n3
do i=1,n1
c(i,k) = 0x0
do j=1,n2
c(i,k) = c(i,k)+a(i,j)*b(j,k)
end do
end do
end do
by simply writing:
do k=1,n3
do i=1,n1
sum = 0x0
do j=1,n2
sum = sum+a(i,j)*b(j,k)
end do
c(i,k) = sum
end do
end do
Now, in undergraduate programming classes, years ago, we were actually taught to look for constant arithmetical or index expressions in loops and put them outside if possible. Today, almost everybody assumes that the compiler is smart enough to take care of that; incorrectly, as you see. To see how good the compilers available under MPW can do, I wrote a Fortran program (listing 1) that calls several versions of this matrix multiplication loop, written in Fortran (Lang. Sys. 1.2.1), Pascal (Apple 3.0 beta), and C (Apple 3.0 beta). Surprise: none of the compilers was good enough to move the indexing outside of the loop. The following table gives the results (Mac IIx):
Pascal, hand-optimized: 2.7667 seconds
C, register variables, hand-opt.: 3.4667 seconds
Pascal: 4.0333 seconds
Fortran, const. dimensions, opt=3: 4.5000 seconds
Fortran, hand-optimized, opt=3: 4.6500 seconds
C: 4.7167 seconds
Fortran, opt=3: 6.5167 seconds
Fortran, opt=0: 6.6167 seconds
A difference of more than a factor of 2 between the slowest Fortran and the fastest Pascal code. Apple Pascal lived up to its good reputation here, but even that could be improved a lot by eliminating the constant index expression.
Surprised, I ran the Fortran benchmark on a Microvax II, and found that even there some speed could be gained by hand-optimizing the code:
Fortran, plain: 3.6333 seconds
Fortran, hand-opt.: 3.2833 seconds
Fortran, const. dimensions: 3.1000 seconds
However, the difference between the machine-optimized and the hand-optimized version is not quite as big as for the MPW languages (15% for the VAX vs. 27-30% for MPW). If you compile the VAX code without optimization, you get a bigger difference (23%):
Fortran, plain: 6.2500 seconds
Fortran, hand-opt.: 4.8833 seconds
Fortran, const. dimensions: 4.8000 seconds
Therefore, take-home lesson one: dont take compiler optimizations for granted.
The machine code behind it
Benchmarks have been run on lots of different machines, using lots of different compilers. I was interested in how the code generated by the MPW compilers actually differed. A job for Nosy, and the results are shown in the last listing. Ive only printed the innermost loops. Dont be overwhelmed by the pile of assembly code, just note some important details.
First, for the loop optimization examples discussed here, there seems to be no tradeoff between code length and speed. On the contrary, the fastest code is also the shortest. On the other hand, there are some obvious pieces of code which are clearly redundant. The most blatant example is the Fortran-generated code at the end of the listing, where an index expression is recalculated that was actually in register A1 all the time! 14 extra lines of machine code on each pass through the loop will add up to quite some extra time lost. Another point is that Language System obviously has no great trust in the quality of the 68000/20/30, otherwise how can one explain that they repeat the EXT.L D2 instruction each time it occurs? To make sure it works at least once?
Language Systems Fortran makes other funny assumptions about the machine, for instance it seems to think there are only two floating point registers in the 68881, FP0 and FP7. I have looked at some code which had great potential for optimization by using enough FP registers. Language Systems is, however, known for its responsiveness towards customers, so I hope we wont have to wait too long until a well-optimized Fortran shows up.
Both Pascal and C like juggling floating point registers. Why generate (like Apples C):
FMOVE FP7,FP1
FADD FP0,FP1
FMOVE FP1,FP7
when a simple FADD FP0,FP7 would suffice? Eliminates two floating point instructions per loop. Pascal does
FADD FP7,FP0
FMOVE FP0,FP7
when a simple inversion of the operands
FADD FP0,FP7
would give the same result. One floating point instruction per loop eliminated. The timing difference between the Pascal and C routines is partly because of the one extra floating point instruction.
Last remark: I havent seen the Absoft MPW 3.0 Fortran yet. If anyone from Absoft is reading this, Id like an evaluation copy to run the same analysis (since you claim in your ads you have such a great optimizer). If I get enough other languages collected together, well have a follow-up on this article.
Next month
The MacHack is over (thanks, Aimée, Carol, and all the others, for organizing such a good meeting), and Ill tell you some of my impressions in the next column. Otherwise, well start with an introduction to C++; I hope the compiler will arrive here in time.
Listing 1: Matrix multiplication benchmark
!!S Main
program matrix
c
c Main program in Language Systems Fortran
c
c Some line breaks in the Fortran program are due to
c editing.
c
implicit none
integer i,j,ticks1,ticks2
extended a(50,50), b(50,50), c(50,50)
extended time1,time2
integer ticks
type *,Matrix multiplication benchmark
type *,------------------------------
type *
type *,This program compares the number crunching power
type *,of some of the popular MPW compilers.
type *,Written under MPW 3.0 by J. Langowski / MacTutor 1989'
type *
type *,Setting up 50x50 matrices...
ticks1 = ticks()
do i=1,50
do j=1,50
a(i,j) = (i-1) + j-1
b(j,i) = a(i,j)
end do
end do
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for setting up matrices)) time1
ticks1 = ticks()
call mat_mult_for3(c,50,a,50,b,50,50,50,50)
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using FORTRAN routine, opt=3')) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mult_for(c,50,a,50,b,50,50,50,50)
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using FORTRAN routine, opt=0')) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mult_for1(c,50,a,50,b,50,50,50,50)
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using FORTRAN routine, hand-optimized)) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mult_for0(c,50,a,50,b,50,50,50,50)
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using FORTRAN routine, constant dimensions)) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mul_pas(c,%val(50),a,%val(50),b,%val(50),%val(50),%val(50),%val(50))
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using PASCAL routine)) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mul_pas_opt(c,%val(50),a,%val(50),b,%val(50),%val(50),%val(50),%val(50))
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using PASCAL routine, hand-optimized)) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mul_c(c,%val(50),a,%val(50),b,%val(50),%val(50),%val(50),%val(50))
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using C routine)) time1
type *,c(25,25) = ,c(25,25)
ticks1 = ticks()
call mat_mul_c_opt(c,%val(50),a,%val(50),b,%val(50),%val(50),%val(50),%val(50))
ticks2 = ticks()
time1 = (ticks2-ticks1)/60.
type *
write (6,(f8.4, seconds for multiplying matrices,
* using C routine, hand-optimized)) time1
type *,c(25,25) = ,c(25,25)
end
!!S Main
subroutine mat_mult_for3(c,nc,a,na,b,nb,n1,n2,n3)
c sets c=a.b
c na,nb,nc are first dimensions
c n1 n2 n3 are problem dimensions
c c is n1xn3
c a n1 n2
c b n2 n3
implicit none
integer na,nb,nc,n1,n2,n3
integer*2 i,j,k
extended c(nc,n3),a(na,n2),b(nb,n3)
do k=1,n3
do i=1,n1
c(i,k) = 0x0
do j=1,n2
c(i,k) = c(i,k)+a(i,j)*b(j,k)
end do
end do
end do
return
end
!!S Main
subroutine mat_mult_for1(c,nc,a,na,b,nb,n1,n2,n3)
c
csame as before, invariant matrix element eliminated from loop
c
implicit none
integer na,nb,nc,n1,n2,n3
integer*2 i,j,k
extended c(nc,n3),a(na,n2),b(nb,n3)
extended sum
do k=1,n3
do i=1,n1
sum = 0x0
do j=1,n2
sum = sum+a(i,j)*b(j,k)
end do
c(i,k) = sum
end do
end do
return
end
!!S Main
subroutine mat_mult_for0(c,nc,a,na,b,nb,n1,n2,n3)
c
csame as before, with constant dimensions
c
implicit none
integer na,nb,nc,n1,n2,n3
integer*2 i,j,k
extended c(50,50),a(50,50),b(50,50)
extended sum
do k=1,n3
do i=1,n1
sum = 0x0
do j=1,n2
sum = sum+a(i,j)*b(j,k)
end do
c(i,k) = sum
end do
end do
return
end
!!S Main
integer function ticks
ticks = long(362)
return
end
Listing 2 : non-optimized Fortran routine
!!S Main
subroutine mat_mult_for(c,nc,a,na,b,nb,n1,n2,n3)
c
cduplicate of mat_mul_for3 for compiling without optimization
c
implicit none
integer na,nb,nc,n1,n2,n3
integer*2 i,j,k
extended c(nc,n3),a(na,n2),b(nb,n3)
do k=1,n3
do i=1,n1
c(i,k) = 0x0
do j=1,n2
c(i,k) = c(i,k)+a(i,j)*b(j,k)
end do
end do
end do
return
end
Listing 3 : Pascal routine
{$S Main}
{$R-}
unit matmul;
interface
type matrix = array [1..50,1..50] of extended;
procedure mat_mul_pas
(var c : matrix; nc : longint;
var a : matrix; na : longint;
var b : matrix; nb : longint;
n1,n2,n3:longint);
procedure mat_mul_pas_opt
(var c : matrix; nc : longint;
var a : matrix; na : longint;
var b : matrix; nb : longint;
n1,n2,n3:longint);
implementation
procedure mat_mul_pas;
var
i,j,k:integer;
begin
for k:=1 to n3 do
for i:=1 to n1 do
begin
c[i,k] := 0;
for j:=1 to n2 do
c[i,k] := c[i,k]+a[i,j]*b[j,k];
end;
end;
procedure mat_mul_pas_opt;
var
i,j,k:integer; sum:extended;
begin
for k:=1 to n3 do
for i:=1 to n1 do
begin
sum := 0;
for j:=1 to n2 do
sum := sum+a[i,j]*b[j,k];
c[i,k] := sum;
end;
end;
end.
Listing 4 : C routine
pascal void mat_mul_c
(extended c[50][], long nc,
extended a[50][], long na,
extended b[50][], long nb,
long n1, long n2, long n3)
{
int i,j,k;
for ( k=1 ; k <= n3; k++ )
for ( i=1 ; i <= n1 ; i++ )
{
c[i][k] = 0.0;
for ( j=1 ; j <= n2 ; j++ )
c[i][k] = c[i][k]+a[i][j]*b[j][k];
}
}
pascal void mat_mul_c_opt
(extended c[50][], long nc,
extended a[50][], long na,
extended b[50][], long nb,
long n1, long n2, long n3)
{
register int i,j,k;
register extended sum;
for ( k=1 ; k <= n3; k++ )
for ( i=1 ; i <= n1 ; i++ )
{
sum = 0.0;
for ( j=1 ; j <= n2 ; j++ )
sum = sum+a[i][j]*b[j][k];
c[i][k] = sum;
}
}
Listing 5 : inner loops compared, Nosy-disassembled
pascal, optimized
lan_3 MOVEA.L param2(A6),A0
MOVE D6,D0
MULS #$258,D0
MOVE D5,D1
MULS #12,D1
ADD D1,D0
MOVEA.Lparam3(A6),A1
MOVE D5,D1
MULS #$258,D1
MOVE D7,D2
MULS #12,D2
ADD D2,D1
LEA -$264(A0),A0
FMOVE.X0(A0,D0.W),FP0
LEA -$264(A1),A0
FMUL.X 0(A0,D1.W),FP0
FADD FP7,FP0 ; could use
FMOVE FP0,FP7 ; FADD FP0,FP7 here
ADDQ #1,D5
BVS.S lan_5
lan_4 CMP.W van_1(A6),D5
BLE lan_3
c, optimized
lar_1 MOVE.LD7,D0
MOVE.L D0,D1
MULU #12,D0
SWAP D1
MULU #12,D1
SWAP D1
CLR D1
ADD.L D1,D0
ADD.L D6,D0
MOVE.L D5,D1
MOVE.L D1,D2
MULU #12,D1
SWAP D2
MULU #12,D2
SWAP D2
CLR D2
ADD.L D2,D1
ADD.L D7,D1
FMOVE.X 0(A3,D0.L),FP0
FMUL.X 0(A4,D1.L),FP0
FMOVE FP7,FP1
FADD FP0,FP1
FMOVE FP1,FP7
ADDQ.L #1,D7
lar_2 CMP.L D7,D4
BGE lar_1
pascal, plain
lam_3 MOVEA.L param2(A6),A0
MOVE D6,D0
MULS #$258,D0
MOVE D5,D1
MULS #12,D1
ADD D1,D0
MOVEA.L param3(A6),A1
MOVE D5,D1
MULS #$258,D1
MOVE D7,D2
MULS #12,D2
ADD D2,D1
LEA -$264(A0),A0
FMOVE.X 0(A0,D0.W),FP0
LEA -$264(A1),A0
FMUL.X 0(A0,D1.W),FP0
MOVE D6,D0
MULS #$258,D0
MOVE D7,D1
MULS #12,D1
ADD D1,D0
LEA -$264(A4),A0
FADD.X 0(A0,D0.W),FP0
MOVE D6,D0
MULS #$258,D0
MOVE D7,D1
MULS #12,D1
ADD D1,D0
LEA -$264(A4),A0
FMOVE.X FP0,0(A0,D0.W)
ADDQ #1,D5
BVS.S lam_5
lam_4 CMP.W vam_1(A6),D5
BLE lam_3
c, plain
lao_3 MOVE.LD5,D0
MOVE.L D0,D1
MULU #12,D0
SWAP D1
MULU #12,D1
SWAP D1
CLR D1
ADD.L D1,D0
ADD.L D6,D0
MOVE.L D7,D1
MOVE.L D1,D2
MULU #12,D1
SWAP D2
MULU #12,D2
SWAP D2
CLR D2
ADD.L D2,D1
ADD.L D6,D1
MOVEA.L param3(A6),A0
MOVE.L D5,D2
MOVE.L D2,D3
MULU #12,D2
SWAP D3
MULU #12,D3
SWAP D3
CLR D3
ADD.L D3,D2
ADD.L D7,D2
FMOVE.X 0(A4,D1.L),FP0
FMUL.X 0(A0,D2.L),FP0
FADD.X 0(A3,D0.L),FP0
MOVE.L D5,D0
MOVE.L D0,D1
MULU #12,D0
SWAP D1
MULU #12,D1
SWAP D1
CLR D1
ADD.L D1,D0
ADD.L D6,D0
FMOVE.X FP0,0(A3,D0.L)
ADDQ.L #1,D7
lao_4 CMP.L D7,D4
BGE lao_3
Fortran, optimized
lah_3 MOVE-172(A6),D2
EXT.L D2
EXT.L D2
SUB.L -142(A6),D2
MULS.L #12,D2
MOVE.L D2,D0
MOVE -170(A6),D2
EXT.L D2
EXT.L D2
SUB.L -130(A6),D2
MULS.L -134(A6),D2
ADD.L D0,D2
MOVEA.L 32(A6),A0
ADDA.L D2,A0
FMOVE.X (A0),FP7
MOVE -170(A6),D2
EXT.L D2
EXT.L D2
SUB.L -118(A6),D2
MULS.L #12,D2
MOVE.L D2,D1
MOVE -168(A6),D2
EXT.L D2
EXT.L D2
SUB.L -106(A6),D2
MULS.L -110(A6),D2
ADD.L D1,D2
MOVEA.L 24(A6),A1
ADDA.L D2,A1
FMUL.X (A1),FP7
FADD.X -94(A6),FP7
FMOVE.X FP7,-94(A6)
ADDQ #1,-170(A6)
SUBQ.L #1,D5
BGT lah_3
Fortran, plain
lae_3 MOVE-164(A6),D2
EXT.L D2
EXT.L D2
SUB.L -134(A6),D2
MULS.L #12,D2
MOVE.L D2,D0
MOVE -162(A6),D2
EXT.L D2
EXT.L D2
SUB.L -122(A6),D2
MULS.L -126(A6),D2
ADD.L D0,D2
MOVEA.L 32(A6),A0
ADDA.L D2,A0
FMOVE.X (A0),FP7 ; get a(i,k)
MOVE -162(A6),D2
EXT.L D2
EXT.L D2
SUB.L -110(A6),D2
MULS.L #12,D2
MOVE.L D2,D1
MOVE -160(A6),D2
EXT.L D2
EXT.L D2
SUB.L -98(A6),D2
MULS.L -102(A6),D2
ADD.L D1,D2
MOVEA.L 24(A6),A1
ADDA.L D2,A1
FMUL.X (A1),FP7 ; multiply by b(i,k)
MOVE -164(A6),D2
EXT.L D2
EXT.L D2
SUB.L -158(A6),D2
MULS.L #12,D2
MOVE.L D2,D1
MOVE -160(A6),D2
EXT.L D2
EXT.L D2
SUB.L -146(A6),D2
MULS.L -150(A6),D2
ADD.L D1,D2
MOVEA.L 40(A6),A1
ADDA.L D2,A1
FADD.X (A1),FP7 ; add c(i,k)
MOVE -164(A6),D2; this
EXT.L D2; whole
EXT.L D2; stuff
SUB.L -158(A6),D2; is
MULS.L #12,D2 ;
MOVE.L D2,D1 ; R
MOVE -160(A6),D2; E
EXT.L D2; D
EXT.L D2; U
SUB.L -146(A6),D2; N
MULS.L -150(A6),D2; D
ADD.L D1,D2 ; A
MOVEA.L 40(A6),A1; N
ADDA.L D2,A1 ; T !!!!
FMOVE.X FP7,(A1) ; put back c(i,k)
ADDQ #1,-162(A6)
SUBQ.L #1,D5
BGT lae_3