TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Final Cut Pro 10.6.4 - Professional vide...
Redesigned from the ground up, Final Cut Pro combines revolutionary video editing with a powerful media organization and incredible performance to let you create at the speed of thought.... Read more
iMovie 10.3.4 - Edit personal videos and...
With a streamlined design and intuitive editing features, iMovie lets you create Hollywood-style trailers and beautiful movies like never before. Browse your video library, share favorite moments,... Read more
Motion 5.6.2 - Create and customize Fina...
Motion is designed for video editors, Motion 5 lets you customize Final Cut Pro titles, transitions, and effects. Or create your own dazzling animations in 2D or 3D space, with real-time feedback as... Read more
iMazing 2.15.8 - Complete iOS device man...
iMazing is the world’s favourite iOS device manager for Mac and PC. Millions of users every year leverage its powerful capabilities to make the most of their personal or business iPhone and iPad.... Read more
VueScan 9.7.90 - Scanner software with a...
VueScan is a scanning program that works with most high-quality flatbed and film scanners to produce scans that have excellent color fidelity and color balance. VueScan is easy to use, and has... Read more
Compressor 4.6.2 - Adds power and flexib...
Compressor adds power and flexibility to Final Cut Pro X export. Customize output settings, work faster with distributed encoding, and tap into a comprehensive set of delivery features. Features:... Read more
Capture One 15.3.2.11 - RAW workflow sof...
Capture One is a professional RAW converter offering you ultimate image quality with accurate colors and incredible detail from more than 400 high-end cameras - straight out of the box. It offers... Read more
Vivaldi 5.4.2753.28 - An advanced browse...
Vivaldi is a browser for our friends. We live in our browsers. Choose one that has the features you need, a style that fits and values you can stand by. From the look and feel, to how you interact... Read more
Parallels Desktop 18.0.0 - Run Windows a...
Parallels allows you to run Windows and Mac applications side by side. Choose your view to make Windows invisible while still using its applications, or keep the familiar Windows background and... Read more
TechTool Pro 16.0.1 - Hard drive and sys...
TechTool Pro has long been one of the foremost utilities for keeping your Mac running smoothly and efficiently. With the release of this version, it has become more proficient than ever. Main... Read more

Latest Forum Discussions

See All

Turn-Based RPG ‘Avatar: Generations’ Sof...
Square Enix London Mobile, Navigator Games, and Paramount Consumer Products just announced that the turn-based RPG Avatar: Generations based on Nickelodeon’s Avatar: The Last Airbender is soft launching this month for mobile. Avatar: Generations is... | Read more »
Tower of Fantasy launches today and brin...
Level Infinite and Hotta Studio have announced the release of their very ambitious looking shared open world MMORPG Tower of Fantasy. With its cross-platform functionality between PC and mobile, it looks to be one to roll the dice on and enjoy at... | Read more »
‘Genshin Impact’ Version 3.0 Gets a New...
After HoYoverse released Genshin Impact (Free) version 2.8 on all platforms, the company has slowly been teasing the major upcoming 3.0 update. This update features the Sumeru region with many characters. While details on the update including a... | Read more »
Out Now: ‘Tower of Fantasy’, ‘Tightrope...
Each and every day new mobile games are hitting the App Store, and so each week we put together a big old list of all the best new releases of the past seven days. Back in the day the App Store would showcase the same games for a week, and then... | Read more »
SwitchArcade Round-Up: ‘Book Quest’, ‘Cl...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for August 10th, 2022. In today’s article, we’ve got a little news about an update to a game I really like, a few new releases to summarize, and some sales to look at. A bit of a quiet... | Read more »
‘Pine Tar Poker’ is an Otherworldly Poke...
Developer BJ Malicoat, who put out the well-received and former Apple Game of the Day pick Downwordly in June of last year, is back working on another mobile game project called Pine Tar Poker, and it has caught my attention. Why? Because it’s a... | Read more »
Darkness Rises celebrates four years of...
Four years of uptime for a mobile game is akin to eternity, and this is exactly the milestone that Darkness Rises has reached. It is important for developers to keep updating to keep the game fresh, and NEXON has announced a massive anniversary... | Read more »
Keep Your Smatphone’s Case On When Using...
The original Gamevice was born as a sort of offshoot of the weird Wikipad gaming tablet/controller/hybrid thing way back in 2014. Interestingly, the first Gamevice controller for iOS only supported the iPad mini and launched in 2015, with versions... | Read more »
SwitchArcade Round-Up: A ‘Splatoon 3’ Ni...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for August 9th, 2022. In today’s article, we’ve got some news about a Splatoon 3 Nintendo Direct, a review of QUByte’s Thunderbolt Collection, a single new release summary, and the usual... | Read more »
Orangepixel’s Pacifist Survival Game ‘Re...
Back in June we learned that long-time mobile developer Orangepixel, who also makes games for PC and consoles (including the Atari VCS!), would be bringing the unique survival game Residual to mobile devices sometime this year. Originally launched... | Read more »

Price Scanner via MacPrices.net

Apple has 24-inch M1 iMacs available starting...
Apple has 24-inch M1 iMacs with M1 CPUs (8-core CPU/7-core GPU) available today in their Certified Refurbished store for $1099 shipped. Their price is $200 off standard MSRP. Each iMac is in like-new... Read more
13″ M1 MacBook Airs in stock today for $799,...
QuickShip Electronics has open-box return 13″ M1 MacBook Airs in stock and on sale for $200 off MSRP on their eBay store right now, each with free express delivery. According to QuickShip, “The item... Read more
In stock today: Mac Studio models for up to $...
Apple retailer Expercom has Mac Studio models in stock today and on sale for up to $400 off Apple’s MSRP, depending on configuration. Their prices are the lowest price available for a Mac Studio from... Read more
Mac mini with M1 CPU and 512GB of storage on...
Amazon has the M1 Mac mini with a 512GB SSD in stock today on sale for $749.99 including free shipping. Their price is $150 off Apple’s MSRP, and it’s the lowest price available for this... Read more
Need a Mac or iPad for school? Get a free App...
Apple’s Back to School promotion for 2022 continues to run through September 26, 2022. As part of this promotion, Apple will include a free $150 Apple Gift Card with the purchase of any MacBook Air,... Read more
Apple Watch SE on sale for $50 off MSRP
Amazon has Apple Watch SE GPS models on sale for $50 off MSRP for a limited time, each including free shipping. Their prices are the lowest currently available for SE Watches: – 40mm Apple Watch SE... Read more
Save $310 on a 14″ 24-core GPU M1 Max MacBook...
Save $310 on 14″ MacBook Pros with 24-core M1 Max processors at Apple (32GB RAM/1TB SSD) with these Certified Refurbished models in stock today for $2789 in Space Gray or Silver colors. Regular price... Read more
14″ M1 Pro MacBook Pros available today at Ap...
Apple has Certified Refurbished standard-configuration 14″ MacBook Pros with M1 Pro CPUs available today for up to $250 off original MSRP, starting at $1799. Each model features a new outer case,... Read more
13″ MacBook Air with M2 CPU, in Starlight, on...
Apple retailer Expercom has the new Starlight 13″ MacBook Air with an M2 CPU (8GB RAM/256GB SSD) on sale for $1135.05, shipped, through August 12, 2022. Their price is $64 off Apple’s MSRP, and it’s... Read more
14″ M1 Pro MacBook Pro with 1TB SSD on sale f...
Expercom is offering a $200 instant discount on the 14″ M1 Pro MacBook Pro with a 1TB SSD through August 12, 2022. Their discount reduces the price of this configuration to $1999 shipped — the lowest... Read more

Jobs Board

Solutions Engineering Manager - *Apple* - S...
…in our Hardware and Advanced Solutions group leading and developing our Apple technical practice to increase revenue and profitability. The ideal candidate would Read more
Operations Associate - *Apple* Blossom Mall...
Operations Associate - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
Cashier - *Apple* Blossom Mall - JCPenney (...
Cashier - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Blossom Mall Read more
Omnichannel Associate - *Apple* Blossom Mal...
Omnichannel Associate - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
Sephora Beauty Advisor - *Apple* Blossom Ma...
Sephora Beauty Advisor - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.