FORTRAN Comparisons
Volume Number: | | 8
|
Issue Number: | | 7
|
Column Tag: | | Jörg's Folder
|
FORTRAN Comparisons
The sequel.
By Jörg Langowski, MacTutor Regular Contributing Author
After my September column on numerical precision in the two competing FORTRAN compilers from Absoft and Language Systems, we received an irate comment from an Absoft user that this comparison was not fair and I hadnt compared both products to their maximum advantage. Excerpt (the less irate part of it) follows, the author wished that his name not be divulged:
Mr. Langowski runs his benchmarks using opt=3 for Language Systems but only basic optimization for Absoft. This is patently unfair. Absoft has loop unrolling and subroutine inclusion opts that he ignores. These greatly speed up the benchmarks. He also penalizes Absoft by including the -e option.
I have run similar comparisons with Absoft armed in double precision mode, so all arithmetic is comparable in accuracy. For the Whetstones, on an FX I get over 6000 with Absoft, which is better than what he gets with a Quadra 700. With a Quadra 950, I get over 13,800 MWhets from Absoft, while I get roughly 5600 from Language Systems.
I should point out that the Absoft manual discusses the problem of extended arithmetic and comparisons, and notifies you of the calls to arm the FPU. I agree that Language Systems has better support for things like Apple Events etc., but lets make sure comparisons are fair. I should also mention that Absoft routinely compiles my code, at all comparable levels of optimization, faster than does Language Systems.
In fact, to recommend only one of the compilers was admittedly a little too strong. I may start with the conclusion of this column, that really both Absoft and LS Fortran compilers are very good products and have their advantages and disadvantages. We contacted Absoft, who had been suspiciously silent throughout all this, and asked them to express their views. We received a very constructive letter that gives a lot of insight into the tradeoffs that compiler makers have to deal with. Here it is:
Dear Mr. Langowski,
We would like take to this opportunity to comment on your September column in MacTutor magazine. It was stated in the article that Absoft has recently announced MacFortran II version 3.1. We actually began shipping version 3.1.2 of MacFortran II in October of 1991. In addition to the enhancements to the user interface, this version includes a code generator which takes advantage of the new 68040 floating point instructions and a math library for FORTRAN intrinsic functions which is based on the Motorola transcendental function library intended to be used with 68040 based machines. [in fact, I had used 3.1.2 in the tests; 3.1 was a typo. sorry. -JL].
A significant portion of the article discusses the Paranoia program which, when compiled with Absoft MacFortran 3.1.2, with and without optimizations, produces a lot of error messages typical of floating point implementations where roundoff is not handled correctly. The method developed in the article to achieve a diagnostic free result by turning off optimization and using the -e option to prevent the compiler from maintaining variables in registers is not the solution we would have chosen or recommended. The Paranoia program can also be successfully negotiated by simply setting the rounding precision of the floating point unit to the precision of the benchmark. This procedure is described on pages 5-13 through 5-15 of the Porting Code chapter of our manual. It has the advantage of not obviating optimization and allows the compiler to maintain values in registers while still performing rounding to the width of the variable. However, this still leaves the question of whether it is valid for a compiler to maintain values in registers as long as possible. We feel that it is, although we do recognize that there are circumstances where control over the side effects of this optimization must be made available to the programmer. In particular, we provide several options and mechanisms to assist in the development of numerically sensitive programs on machines where the register file is wider than main storage or where fast, but not necessarily IEEE conforming instructions are present. The MC68040 provides single and double precision rounded basic operations whose use can improve performance at the cost of extended precision intermediates. In addition, the VOLATILE statement (a VAX extension) allows control over individual variables.
To further illustrate the situation, consider the following program:
C 1
a = 1.0
b = 3.0
c = a/b
if (c .eq. a/b) print *,equal
end
With or without optimization, the MacFortran II compiler generates a program that correctly prints the string equal. On the other hand, under the same conditions, the Language Systems FORTRAN compiler produces a program that is silent. Does this indicate problems with the arithmetic generated by the Language Systems compiler? An inspection of the generated code clearly shows no errors. The Language Systems compiler exhibits seemingly anomalous behavior precisely because it does not maintain the variable c in a register; its precision is truncated to 32 bits when it is stored in memory, but the comparison with the reloaded variable is made against the full 96 bit result of the division. The Language Systems -ansi switch will generate code which will compare successfully, but I am certain they would not recommend indiscriminate use of the option. What this example (and Paranoia) does point out are the problems that a programmer might encounter on a machine where the width of the register file is greater than the main storage [my underline - JL]. The Microsoft compiler for Intel based computers will also fail on this example if optimization is turned off (Intel floating point units are 80 bits wide).
The section of the article which describes the results of the speed tests begins with the cautionary remark we should therefore use the Absoft compiler at least with the -e option, and maybe also drop the optimizations. We urge you to reconsider your conclusions as there is a large body of problems that is not sensitive to environments where greater precision is maintained in registers than in memory. To relegate these programs to slower than optimal performance achieves no useful end. Numerically sensitive programs that explore the boundaries of precision are often better served by setting the floating point unit to the rounding state they expect.
We noticed that several recommended options were not used when running the benchmarks. In particular, subroutine folding and loop unrolling. Although we have not used the Whetstone benchmark for comparison with our competitors on the Macintosh for over a year, it can dramatically demonstrate some of the performance benefits of certain optimization techniques. The real advantage of subroutine inlining or folding is typically not the elimination of the call-return sequence, but rather the opportunities for further optimizations that it exposes to the compiler. This is in fact what happens when the P3 subroutine is folded in the Whetstone benchmark. The compiler is able to determine that the loop is completely invariant and can set the result values without performing a single loop iteration. As small encapsulated functions dictated by modern programming paradigms become more commonplace, this optimization technique will yield even greater performance improvements.
When the -O option (basic optimizations) is used, innermost loops which consist of a single executable statement are automatically unrolled as is indicated on page 4-16 of our manual. This is the case in the saxpy subroutine in the Linpack benchmark. As instruction and data cache sizes become larger, multiple execution units (super-scalar processors) are introduced, and register files are expanded, loop unrolling becomes a very powerful optimization. It allows a compiler to maintain more values in registers, schedule code for the various execution units, and group data loads and stores in a attempt to minimize memory traffic.
When comparing the capabilities of two different compilers on the same piece of hardware, we feel that they each should be shown off to their greatest advantage.
Sincerely,
Peter A. Jacobson
Thank you very much for taking the time to reply. It was an oversight that I had not looked into changing the precision of the FPU for running Paranoia. When you set the FPU to single precision, the generated code passes in fact all the numerical accuracy tests. (for an example how to do this, see the listing).
It is probably a question of philosophy how to handle the situation when the register precision of the FPU differs from that of the memory variables. It is true that if you keep intermediate values in registers during subexpression evaluation, you may get results that differ slightly from the same set of Fortran instructions when the intermediate results have been stored in memory. The speed advantage of using internal registers to their full extent then goes together with the need for controlling yourself the FPU precision in numerically sensitive parts of your code; if such a piece of code is written using 32-bit single precision variables, you should set the FPU to single precision rounding, and reset it to its original state after youre done with that particular routine. I suppose that is not too much work when you are developing e.g., a fast math library.
But lets look at the optimization issue, which is where the two compilers really differ. In order to gain a fair comparison on the general-purpose quality of the code produced by both compilers, I chose those optimization levels on both systems that gave the best results on the Linpack benchmark (and incidentally also on a Monte-Carlo and a Brownian Dynamics simulation real-world problem that I am currently dealing with). Those parameters were for LS Fortran: opt=3 and for Absoft: basic optimizations, no subroutine folding, loop unrolling level 2 or none at all (no big difference).
First of all - why is loop unrolling not an advantage in the Linpack benchmark? (Quote from my V7#1 column: Note that the loop unrolling slows down execution here, because the time-consuming calculations are already unrolled in the source code of the Linpack package.). One example is given in listing 2: one of the subroutines that is called by the Linpack benchmark, saxpy, which computes a constant times a vector plus a vector. This routine is already loop-unrolled, as are all the routines of the BLAS (basic linear algebra subsystem) which is actually the package of routines tested by Linpack. Therefore, if you add the loop unrolling option, all you do is checking for all the exceptions on the indices again (are they divisible by 2, 4, etc.) that are already contained in the Fortran code itself. By doing so, you create more overhead, and loop unrolling actually slows down execution.
As far as I understood the documentation, the basic optimizations in Absoft were already doing all the things that LS Fortran did at opt=3. Absoft did not get any faster on Linpack by including more options, thus, I thought that it was fair to compare the compilers at those levels.
If you look at the figures in the September 1992 issue, you see that Absoft keeps a noticeable speed advantage over LS Fortran. However, you will never gain a factor of 2 or more, as a Whetstone figure of 13000 KWhet on a Quadra - with loop unrolling and subroutine inlining turned on - might erroneously suggest. Typically, Absofts speed advantage is of the order of 20-30% over LS Fortran, up to 40% if you are compiling really badly written spaghetti code, because this is where Absofts global optimizer really shines.
The Whetstone benchmark is one that you can speed up enormously by subroutine unfolding. But this tells you more about the Whetstone benchmark than about compilers. Whetstone (see the example in listing 2) is organized in a large number of small subroutines that are called over and over again from tight loops. Thus, its result reflects very much the efficiency of subroutine calling, and the efficiency of a global optimizer to see where part or all of a subroutine becomes redundant. Inlining the subroutine P3 in the example causes one of the timing loops of the benchmarks to be eliminated. Now, if you want to test the speed of execution of a routine by calling it a zillion times from a tight loop, and the compiler reduces that zillion down to one because it is intelligent enough to see that you are just repeating yourself - which is nonsense from its point of view - the compiler has tricked you out on your timing routine. (In this case, the machine has become more intelligent than the programmer, now, thats something we all hate, dont we?).
If you have adopted a programming style where you break down your code into very small units, like you do in Forth, global optimizing with subroutine inlining might give you an advantage. However, typical Fortran code (unlike Forth) tends to be organized into larger entities, in a well-written, well-structured program maybe one 60-line page per subroutine. Here, subroutine inlining will do you no more good, itll only increase code size. Even if you get a speed advantage, it will be of the order of a few percent, contrary to what the Whetstone figures might suggest.
As to Absofts comment that loop unrolling becomes a very powerful optimization option as new processors with super-scalar architecture are introduced, that is right; but even the 68040 is not there yet. Absofts speed advantage might really become significant when the compiler is run on other processors. Well review the situation when new Macintoshes come out (and I finally upgrade from my Mac II).
Lets go on from the execution speed difference to other aspects of the two compilers: The readers letter mentioned that he got faster compilation times for the Absoft compilers on all code he tried out, including all optimizations. Now, I dont know on what system he got those figures; on my Mac II, I have used MacFortran II v.3.1.2 and LS Fortran 3.0. In my hands, compilation is always significantly faster under LS Fortran, e.g., the whole compile/link cycle for the Paranoia code under Absoft Fortran takes 5:04 minutes on a Mac II, and 2:57 minutes for LS Fortran on the same machine.
After all these ruminations, the basic conclusion still holds: if you need the absolutely highest execution speed (then you are probably working on a Quadra 950, too), the Absoft compiler will offer you the fastest-running code, by a 20-40% margin. When you need a Macintosh user interface to your Fortran code quickly, and want support for inter-process communication, the Language Systems implementation is far superior, but your execution times will be tens of percent points longer.
In fact, I spent some time splitting up a program of mine and compiling the time-sensitive code with Absoft Fortran using the Pascal-calling option, and the main code for the Mac interface with LS Fortran. Then I tried to link it all together. Unfortunately, this approach failed, because of naming ambiguities in the Fortran libraries. It would be really nice if the two compilers could be made to work together in this way.
Language Systems, in the meantime, has linked me some more information on AppleEvents in Fortran and another product announcement (I wish Id get as much news from Absoft, by the way, especially on their announced AppleEvents support!):
It turns out that there are some workarounds to the limitations you cited, already included in v3.0:
1 The sending of misc dosc events is supported by the F_SendScript call, and documented in Tech Note TNF36. This is a convenient way to send text-based data. We are working on some sample source code for receiving dosc events; Ill send it to you when its ready.
2) here is a trick you can use to send any chunk of data with the dosc event. By calling F_SendEvent with dosc as the first argument, the third argument can be any handle as long as it is less than 32K and passed by value. This works because F_SendScript packs the char or string array into a handle and then calls F_SendEvent.
By the way, char or string variables work just as well as arrays in the F_SendScript call. You just have to set the linecount to 1.
A free upgrade to FORTRAN version 3.0.1 just became available. This version contains the new call F_SetAETimeOut(), which you can use to change the AppleEvent timeout from its default of 60 seconds.
This version contains a few other goodies (including support for the PROFF tool in MPW 3.2.3) and a bunch of minor bug fixes. The FORTRAN upgrade is bundled with an MPW 3.2.3 upgrade for $35.
In October we plan to introduce two new FORTRAN-related products:
TSI Graphics
A general purpose plotting library (developed in cooperation with a third party) that integrates nicely with our output window and menus.
FORTRAN Debugger
A debugging package that includes a lint checker, a special version of the FORTRAN runtime library, and Apples new source level debugger (SourceBug). The lint checker performs extensive argument checking, including mode (%val or %ref) and numeric type on all toolbox calls.
[As you read this, Ill probably already test those products and get ready for a review. Im especially looking forward to the debugger.]
MOPS news
The MOPS user community is growing, as you can see from the next letter - however, I might have to be a little more explicit on how to get and use the MOPS system ]
ATTN: Jörg Langowski via MacTutor (please forward)
Fm: Kent Hoxsey
Dear Jörg,
I have just received (and devoured) my latest issue of MacTutor, with your update of MOPS, and I was wondering if you could help me with a couple of questions about learning FORTH.
Over the past few months, I have developed a growing interest in FORTH, through discussions on the language, articles about programmers using it, and research into the language itself. In fact, I even bought some back issues of MacTutor from a used-book dealer to get some of your old columns.
I am very interested in learning to program with MOPS, and would appreciate some guidance in how to get started. I would appreciate any guidance or recommendations you can offer. It seems to me that I need a compiler/ development system, a copy of MOPS, and a good FORTH book.
The copy of MOPS and the FORTH book I can get -relatively- easily, especially if you can recommend a good book title. However, I have not been able to find anyone who can tell me where to get a copy of NEON. Do you know where it is available? Will MOPS work with any other FORTH compiler?
I have Michael Hores address from MacTutor, but would be very interested in his Internet address if you have it. (Forgive my naivete, but I do not know how an AppleLink user can send mail to someone on CompuServe.)
Thank you for any assistance you can offer, and look forward to further columns on FORTH development.
Regards, Kent
Caledonia Software, 215 Caledonia St., Sausalito, CA 94965
(415) 289-2555, Fax (415) 289-2557, Alink: RUDEDOG
Dear Kent,
First, dont worry about getting NEON; MOPS is a fully stand-alone development system with Forth compiler and everything, only its philosophy and much of the higher-level code is based on NEON.
You can download MOPS from oddjob.uchicago.edu by ftp; it is in the /pub/Yerk directory. Yerk, by the way, is the successor of NEON, now in the public domain, maintained by Bob Loewenstein, an astronomer at Yerkes observatory at the University of Chicago. Yerk and Mops are very similar, the main differences being that Yerk kept the old NEON Forth kernel and Mops has its own, much faster one, and that the class structure and definitions have been changed a little in Mops over Yerk/NEON. You can download Yerk from the same place. [There is also the MacForth commercial compiler from Creative Solutions that you might want to look into. - Ed.]
For a good set of Forth books, I recommend Starting Forth and Thinking Forth by Leo Brodie, published by Prentice-Hall; they can be found through any good computer bookstore.
You can send mail to Compuserve users from AppleLink just like to Internet addresses: the Internet mail address of a Compuserve user - lets say, with ID 98765,432 - would be 98765.432@compuserve.com.@INTERNET#. Note that the two numbers of the CompuServe ID in the Internet address are separated by a period, not by a comma. Michael Hores Internet address is: mikeh@kralizec.zeta.org.au.
Hope this helps,
Jörg
I should also say that MOPS is meanwhile at version 2.2; here is a short message from Mike that came in recently:
Hello again Joerg,
Mops 2.2 is now released, and is available for ftp from the usual place at oddjob.uchicago.edu courtesy of Bob Loewenstein.
Ive made a lot of improvements to the Scroller class, in particular we now have the panorama idea implemented. Also the tutorial is now complete, and the first part of the rest of the manual.
I havent updated the user list lately, but there are several new users on Compuserve and GEnie (Ive uploaded it to both those places).
The new MacTutor is great. [Thank you! - JL]
Cheers, Mike.
Listing 1: Setting FPU precision in Absoft Fortran
Cset rounding precision for ABsoft Fortran
INTEGER*4 i,getfpcontrol
INTEGER*4 RPCLEAR
PARAMETER(RPCLEAR=zFFFFFF3F)
INTEGER*4 RPSINGLE
PARAMETER (RPSINGLE=z00000040)
i = getfpcontrol()! get current control word
i = (i .and. RPCLEAR) ! clear current rounding
precision
i = (i .or. RPSINGLE) ! or in new prec.
call setfpcontrol(i)! set new control word
Listing 2: Example subroutines from the Linpack and Whetstone benchmarks
(Whetstone, main program)
CMODULE 8: PROCEDURE CALLS
X=1.0
Y=1.0
Z=1.0
DO 80 I=1,N8
CALL P3(X,Y,Z)
80CONTINUE
SUBROUTINE P3(X,Y,Z)
COMMON /B/ T,T2
X=T*(X+Y)
Y=T*(X+Y)
Z=(X+Y)/T2
RETURN
END
(saxpy subroutine from the Linpack benchmark)
subroutine saxpy(n,da,dx,incx,dy,incy)
c
cconstant times a vector plus a vector.
cuses unrolled loops for increments equal to one.
cjack dongarra, linpack, 3/11/78.
c
real dx(1),dy(1),da
integer i,incx,incy,ix,iy,m,mp1,n
c
if(n.le.0)return
if (da .eq. 0.0e0) return
if(incx.eq.1.and.incy.eq.1)go to 20
c
ccode for unequal increments or equal increments
cnot equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dy(iy) = dy(iy) + da*dx(ix)
ix = ix + incx
iy = iy + incy
10 continue
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,4)
if( m .eq. 0 ) go to 40
do 30 i = 1,m
dy(i) = dy(i) + da*dx(i)
30 continue
if( n .lt. 4 ) return
40 mp1 = m + 1
do 50 i = mp1,n,4
dy(i) = dy(i) + da*dx(i)
dy(i + 1) = dy(i + 1) + da*dx(i + 1)
dy(i + 2) = dy(i + 2) + da*dx(i + 2)
dy(i + 3) = dy(i + 3) + da*dx(i + 3)
50 continue
return
end