TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Daylite 6.7.3 - Dynamic business organiz...
Daylite helps businesses organize themselves with tools such as shared calendars, contacts, tasks, projects, notes, and more. Enable easy collaboration with features such as task and project... Read more
VirtualBox 6.0.10 - x86 virtualization s...
VirtualBox is a family of powerful x86 virtualization products for enterprise as well as home use. Not only is VirtualBox an extremely feature rich, high performance product for enterprise customers... Read more
BetterTouchTool 3.149 - Customize multi-...
BetterTouchTool adds many new, fully customizable gestures to the Magic Mouse, Multi-Touch MacBook trackpad, and Magic Trackpad. These gestures are customizable: Magic Mouse: Pinch in / out (zoom)... Read more
Dropbox 77.4.131 - Cloud backup and sync...
Dropbox is an application that creates a special Finder folder that automatically syncs online and between your computers. It allows you to both backup files and keeps them up-to-date between systems... Read more
MainStage 3 3.4.3 - Live performance too...
Apple MainStage makes it easy to bring to the stage all the same instruments and effects that you love in your recording. Everything from the Sound Library and Smart Controls you're familiar with... Read more
Paperless 3.0.6 - $69.95
Paperless is a digital documents manager. Remember when everyone talked about how we would soon be a paperless society? Now it seems like we use paper more than ever. Let's face it - we need and we... Read more
TextMate 2.0.rc.29 - Code/markup editor...
TextMate is a versatile plain text editor with a unique and innovative feature set which caused it to win an Apple Design Award for Best Mac OS X Developer Tool in August 2006 A rapidly growing... Read more
Notability 4.0.4 - Note-taking and annot...
Notability is a powerful note-taker to annotate documents, sketch ideas, record lectures, take notes and more. It combines, typing, handwriting, audio recording, and photos so you can create notes... Read more
CleanMyMac X 4.4.4 - Delete files that w...
CleanMyMac makes space for the things you love. Sporting a range of ingenious new features, CleanMyMac lets you safely and intelligently scan and clean your entire system, delete large, unused files... Read more
Geekbench 4.4.0 - Measure processor and...
Geekbench provides a comprehensive set of benchmarks engineered to quickly and accurately measure processor and memory performance. Designed to make benchmarks easy to run and easy to understand,... Read more

Latest Forum Discussions

See All

TEPPEN guide - Tips and tricks for new p...
TEPPEN is a wild game that nobody asked for, but I’m sure glad it exists. Who would’ve thought that a CCG featuring Capcom characters could be so cool and weird? In case you’re not completely sure what TEPPEN is, make sure to check out our review... | Read more »
Dr. Mario World guide - Other games that...
We now live in a post-Dr. Mario World world, and I gotta say, things don’t feel too different. Nintendo continues to squirt out bad games on phones, causing all but the most stalwart fans of mobile games to question why they even bother... | Read more »
Strategy RPG Brown Dust introduces its b...
Epic turn-based RPG Brown Dust is set to turn 500 days old next week, and to celebrate, Neowiz has just unveiled its biggest and most exciting update yet, offering a host of new rewards, increased gacha rates, and a brand new feature that will... | Read more »
Dr. Mario World is yet another disappoin...
As soon as I booted up Dr. Mario World, I knew I wasn’t going to have fun with it. Nintendo’s record on phones thus far has been pretty spotty, with things trending downward as of late. [Read more] | Read more »
Retro Space Shooter P.3 is now available...
Shoot-em-ups tend to be a dime a dozen on the App Store, but every so often you come across one gem that aims to shake up the genre in a unique way. Developer Devjgame’s P.3 is the latest game seeking to do so this, working as a love letter to the... | Read more »
Void Tyrant guide - Guildins guide
I’ve still been putting a lot of time into Void Tyrant since it officially released last week, and it’s surprising how much stuff there is to uncover in such a simple-looking game. Just toray, I finished spending my Guildins on all available... | Read more »
Tactical RPG Brown Dust celebrates the s...
Neowiz is set to celebrate the summer by launching a 2-month long festival in its smash-hit RPG Brown Dust. The event kicks off today, and it’s divided into 4 parts, each of which will last two weeks. Brown Dust is all about collecting, upgrading,... | Read more »
Flappy Royale is an incredibly clever ta...
I spent the better part of my weekend playing Flappy Royale. I didn’t necessarily want to. I just felt like I had to. It’s a hypnotic experience that’s way too easy to just keep playing. | Read more »
Void Tyrant guide - General tips and tri...
Void Tyrant is a card-based dungeon-crawler that doesn’t fit in the mold of other games in the genre. Between the Blackjack-style combat and strange gear system alone, you’re left to your own devices to figure out how best to use everything to your... | Read more »
Webzen’s latest RPG First Hero is offici...
You might be busy sending your hulking Dark Knight into the midst of battle in Webzen’s other recent release: the long-anticipated MU Origin 2. But for something a little different, the South Korean publisher has launched First Hero. Released today... | Read more »

Price Scanner via MacPrices.net

Amazon drops prices, now offers clearance 13″...
Amazon has new dropped prices on clearance 13″ 2.3GHz Dual-Core non-Touch Bar MacBook Pros by $200 off Apple’s original MSRP, with prices now available starting at $1099. Shipping is free. Be sure to... Read more
2018 15″ MacBook Pros now on sale for $500 of...
Amazon has dropped prices on select clearance 2018 15″ 6-Core MacBook Pros to $500 off Apple’s original MSRP. Prices now start at $1899 shipped: – 2018 15″ 2.2GHz Touch Bar MacBook Pro Silver: $1899.... Read more
Price drop! Clearance 12″ 1.2GHz Silver MacBo...
Amazon has dropped their price on the recently-discontinued 12″ 1.2GHz Silver MacBook to $849.99 shipped. That’s $450 off Apple’s original MSRP for this model, and it’s the cheapest price available... Read more
Apple’s 21″ 3.0GHz 4K iMac drops to only $936...
Abt Electronics has dropped their price on clearance, previous-generation 21″ 3.0GHz 4K iMacs to only $936 shipped. That’s $363 off Apple’s original MSRP, and it’s the cheapest price we’ve seen so... Read more
Amazon’s Prime Day savings on Apple 11″ iPad...
Amazon has new 2018 Apple 11″ iPad Pros in stock today and on sale for up to $250 off Apple’s MSRP as part of their Prime Day sale (but Prime membership is NOT required for these savings). These are... Read more
Prime Day Apple iPhone deal: $100 off all iPh...
Boost Mobile is offering Apple’s new 2018 iPhone Xr, iPhone Xs, and Xs Max for $100 off MSRP. Their discount reduces the cost of an Xs to $899 for the 64GB models and $999 for the 64GB Xs Max. Price... Read more
Clearance 13″ 2.3GHz Dual-Core MacBook Pros a...
Focus Camera has clearance 2017 13″ 2.3GHz/128GB non-Touch Bar Dual-Core MacBook Pros on sale for $169 off Apple’s original MSRP. Shipping is free. Focus charges sales tax for NY & NJ residents... Read more
Amazon Prime Day deal: 9.7″ Apple iPads for $...
Amazon is offering new 9.7″ WiFi iPads with Apple Pencil support for $80-$100 off MSRP as part of their Prime Day sale, starting at only $249. These are the same iPads found in Apple’s retail and... Read more
Amazon Prime Day deal: 10% (up to $20) off Ap...
Amazon is offering discounts on new 2019 Apple AirPods ranging up to $20 (10%) off MSRP as part of their Prime Day sales. Shipping is free: – AirPods with Charging Case: $144.99 $15 off MSRP –... Read more
Amazon Prime Day deal: $50-$80 off Apple Watc...
Amazon has Apple Watch Series 4 and Series 3 models on sale for $50-$80 off Apple’s MSRP as part of their Prime Day deals with prices starting at only $199. Choose Amazon as the seller rather than a... Read more

Jobs Board

*Apple* Systems Architect/Engineer, Vice Pre...
…its vision to be the world's most trusted financial group. **Summary:** Apple Systems Architect/Engineer with strong knowledge of products and services related to Read more
*Apple* Graders/Inspectors (Seasonal/Hourly/...
…requirements. #COVAentryleveljobs ## Minimum Qualifications Some knowledge of agricultural and/or the apple industry is helpful as well as the ability to comprehend, Read more
Best Buy *Apple* Computing Master - Best Bu...
**710003BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Store Associates **Location Number:** 000171-Winchester Road-Store **Job Description:** Read more
Best Buy *Apple* Computing Master - Best Bu...
**709786BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Sales **Location Number:** 000430-Orange Park-Store **Job Description:** **What does a Read more
Geek Squad *Apple* Master Consultation Agen...
**709918BR** **Job Title:** Geek Squad Apple Master Consultation Agent **Job Category:** Services/Installation/Repair **Location Number:** 000106-Palmdale-Store Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.