TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

MYStuff Pro 2.1.4 - Create inventories f...
MYStuff Pro is one of the most flexible ways to create detail-rich inventories for your home or small business. Add items to MYStuff by dragging and dropping existing information, uploading new... Read more
Day One 6.1 - Maintain a daily journal.
Day One is an easy, great-looking way to use a journal / diary / text-logging application. Day One is well designed and extremely focused to encourage you to write more through quick Menu Bar entry,... Read more
Vivaldi 3.7.2218.55 - An advanced browse...
Vivaldi is a browser for our friends. We live in our browsers. Choose one that has the features you need, a style that fits and values you can stand by. From the look and feel, to how you interact... Read more
Macs Fan Control 1.5.9 - Monitor and con...
Macs Fan Control allows you to monitor and control almost any aspect of your computer's fans, with support for controlling fan speed, temperature sensors pane, menu-bar icon, and autostart with... Read more
Dragon Dictate 6.0 - Premium voice-recog...
With Dragon Dictate speech recognition software, you can use your voice to create and edit text or interact with your favorite Mac applications. Far more than just speech-to-text, Dragon Dictate lets... Read more
OmniFocus 3.11.7 - GTD task manager with...
OmniFocus is an organizer app. It uses projects to organize tasks naturally, and then add tags to organize across projects. Easily enter tasks when you’re on the go, and process them when you have... Read more
rekordbox 6.5.1.0009 - Professional DJ m...
rekordbox is the best way of preparing and managing your tracks, be it at home, in the studio, or even on the plane! It allows you to import music from other music-management software using the... Read more
1Password 7.8.1 - Powerful password mana...
1Password is a password manager that uniquely brings you both security and convenience. It is the only program that provides anti-phishing protection and goes beyond password management by adding Web... Read more
Ableton Live 10.1.35 - Record music usin...
Ableton Live lets you create and record music on your Mac. Use digital instruments, pre-recorded sounds, and sampled loops to arrange, produce, and perform your music like never before. Ableton Live... Read more
Microsoft OneNote 16.48 - Free digital n...
OneNote is your very own digital notebook. With OneNote, you can capture that flash of genius, that moment of inspiration, or that list of errands that's too important to forget. Whether you're at... Read more

Latest Forum Discussions

See All

Pokemon Masters EX's latest update...
Two new Sync Pairs have made their way into Pokemon Masters today. Both pairs hail from the Alola region, Elio & Popplio and Selene & Rowlet. Their arrival coincides with an event called Trials on the Isle. [Read more] | Read more »
Shrouded Citadel: navigate your escape i...
Having been cooped up over the past 12 months due to winter and covid, Pifer is encouraging gamers to start enjoying the great outdoors again with its recently launched AR adventure epic, Shrouded Citadel. | Read more »
Moonlight Sculptor is an upcoming MMORPG...
Kakao Games and XL Games – who you might be familiar with from their previous game ArcheAge – have announced that their MMORPG Moonlight Sculptor is now available to pre-order for iOS and Android devices. Moonlight Sculptor has previously launched... | Read more »
MU Archangel is now open for pre-registr...
MU Archangel is now open for pre-registration in Southeast Asia following its massive success in other territories. Players from Singapore, Thailand, Malaysia, Indonesia, and the Philippines (except Vietnam) can now join in on the fun by applying... | Read more »
Compete, a new social media app you can...
Whoever told you you can’t get rich making videos has obviously never heard of Compete, Competitive Media Technologies Limited’s hot new social media app where you can rake in all the dough just by doing what you love. Video monetization that... | Read more »
Bethesda has released a new DOOM mobile...
Bethesda Softworks has released a new DOOM game out of the blue exclusively for mobile devices. It’s called Mighty DOOM and is currently only available as an early access title on Android but will be expanding to more users in the future. [Read... | Read more »
Anagraphs is a word puzzle game with a t...
Cinq-Mars Media has released its word puzzle game Anagraphs for iOS and Android devices. The game released last week after a short delay in getting it onto the appropriate platforms. [Read more] | Read more »
These are the top 5 best iPhone games li...
Fortnite has been the big hitter in mobile gaming this year, and it's not hard to see why. Thanks to some excellent marketing, and a polished experience that almost anyone can enjoy, it's really taken the App Store by storm. But there are other... | Read more »
The top 5 best iPhone games like Pokemon...
Pokemon GO is still the, if you'll excuse the pun, go-to game if you want some AR action on your phone. But it's not the only choice out there, and if you've got a hankering for something a bit different, then your eyes might already have started... | Read more »
The top 5 best iPhone games like Starcra...
Starcraft sits at the top of the RTS tree for a number of very good reasons. It also isn't on mobile, again, for a number of very good reasons. But that doesn't mean you can't find a way to indulge your sci-fi, competitive, massive, or engaging RTS... | Read more »

Price Scanner via MacPrices.net

Roundup of Today’s Best 13″ M1 MacBook Pro De...
Apple resellers are offering sale prices on Apple’s M1-powered 13″ MacBook Pro ranging up to $230 off MSRP. Here’s where to pick one up today, and as always, keep an eye on our 13″ MacBook Pro Price... Read more
Roundup of Today’s Best M1 Mac mini Prices an...
Apple resellers are offering discounts on new M1 Mac minis ranging up to $140 off MSRP this week, with prices starting at only $589. These are all the same Mac minis sold by Apple in their retail and... Read more
New at Verizon: Apple iPhone SE for free with...
Verizon is offering the 64GB Apple iPhone SE for free for customers opening a new line of service with a Verizon Unlimited plan. Offer is valid for a limited time. Price is credited monthly over a 24... Read more
B&H is offering clearance prices on lefto...
Apple reseller B&H Photo has clearance 2020 13″ 1.4GHz Intel-based MacBook Pros on sale today for $200-$300 off Apple’s original MSRP with prices starting at only $1099. Expedited shipping is... Read more
Roundup of Today’s Best MacBook Deals: M1 Mac...
Apple resellers are offering sale prices on Apple’s M1-powered 13″ MacBook Airs ranging up to $190 off MSRP. Here’s where to pick one up today, and as always, keep an eye on our 13″ MacBook Air Price... Read more
Apple AirPods Pro drop to new low price of on...
Amazon has Apple’s AirPods Pro on sale today for a new low price of only $197 shipped. That’s $52 off MSRP and the lowest price currently available for a set of AirPods Pro from any Apple reseller.... Read more
Apple restocks clearance 13″ Intel-based MacB...
Apple has clearance, Certified Refurbished, 2020 13″ Intel-based MacBook Airs available starting at only $809 and up to $280 off original MSRP. Each MacBook features a new outer case, comes with a... Read more
OWC drops prices on 2020 Intel multi-core Mac...
Other World Computing has clearance 2020 Intel-based Mac minis on sale starting at only $499. Both 4-core and 6-core models are in stock today. These are new, unopened, factory-sealed minis: – 3.6GHz... Read more
Save $50 off Apple’s 10.9″ iPad Air today at...
B&H Photo has new 10.9″ Apple iPad Airs in stock and on sale today for up to $50 off MSRP. Expedited shipping is free to most addresses in the US. Note that some sale prices may be restricted to... Read more
Rare Apple sale: Get a HomePod mini for $10 o...
Apple reseller Expercom has the Space Gray HomePod mini on sale today for $89 shipped. Their price is $10 off Apple’s MSRP, and it’s currently the only sale price available for a HomePod mini among... Read more

Jobs Board

Systems Architect, *Apple* Production Engin...
…package beginning on your first day? If so, we hope you'll keep reading! The Apple Sales Engineering and account team is looking for a stellar presales engineer with Read more
*Apple* / Macintosh / ADM Systems Administra...
…Administration **Duties and Responsibilities** + Configure and maintain the client's Apple Device Management (ADM) solution. The current solution is JAMF supporting Read more
*Apple* Mobility Specialist - Best Buy (Unit...
**800895BR** **Job Title:** Apple Mobility Specialist **Job Category:** Store Associates **Store Number or Department:** 001776-Woodmore Towne Centre-Store **Job Read more
Geek Squad Advanced Repair *Apple* Professi...
**802113BR** **Job Title:** Geek Squad Advanced Repair Apple Professional **Job Category:** Store Associates **Store Number or Department:** 000399-Wausau-Store Read more
*Apple* Mobility Specialist - Best Buy (Unit...
**802109BR** **Job Title:** Apple Mobility Specialist **Job Category:** Store Associates **Store Number or Department:** 001540-Tuscaloosa-Store **Job Description:** Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.