TweetFollow Us on Twitter

Digital Media Boost With the Intel Core Duo Processor

Volume Number: 22 (2006)
Issue Number: 8
Column Tag: Performance Optimization

Digital Media Boost With the Intel Core Duo Processor

Extracting maximum performance from your applications

by Ron Wayne Green and Ganesh Rao

Introduction

This is the second part of a three part series that addresses the most effective techniques to optimize applications for Intel(R) Core(TM) Duo processor-based Apple Macintosh computers. Part one of this series introduced the key aspects of the Intel Core Duo processor and exposed the architectural features for which tuning is most important. Also presented in that first article was a data-driven performance methodology using the software development tools available on an Intel processor-based Apple Macintosh to highlight tuning and optimization opportunities. This article, the second part of this 3-part series, introduces the Intel(R) Digital Media Boost technology of the Intel Core Duo processor, its capabilities, and how a programmer can exploit this computing power. The final part of this three-part series to come in a future MacTech issue will provide readers with the next level of optimization - taking advantage of both execution cores in the Intel Core Duo processor.

In this article, we examine the Intel Digital Media Boost enhancements to the Streaming SIMD Extension (SSE) features of the Intel Core Duo processor. We also describe how to direct the Intel compilers to leverage these features for optimal application performance. Finally, we will examine inhibitors to the use of these advanced hardware features and how to remove some of these inhibitors. Examples will be illustrated with C++ and Fortran code snippets.

Goal: Integer and Floating Point Calculations

Before we dive into the details of SSE and Intel Digital Media Boost, let's understand our goals for this article. Reviewing our high-level diagram of the Intel Core Duo processor, Figure 1, we see that the processor has two cores. Each core is a full-feature, tradition CPU which includes registers, instruction pipeline and execution unit, and advanced integer and floating point arithmetic units. Our goal for this article is to focus on a single core (either one as they are equivalent) and look at the hardware provided to accelerate integer and floating point calculations.



Figure 1: Intel(R) Core(TM) Duo processor architecture

SIMD: A methodology for performing calculations in Parallel

Single-Instruction, Multiple Data (SIMD) is a methodology for performing the same mathematical operation on a data set. Imagine that you have 1,000 elements in two rank-1 arrays, or vectors in mathematical terms, and you wish to add the elements of the arrays. Let's call the operand arrays A and B, and we wish to store them in a third array, C, as shown below:

real, dimension(1000) :: a, b, c
do  i=1,1000
   c(i) = a(i) + b(i)
end do

Or the equivalent loop expressed in Fortran 90 array syntax:

c = a + b

For this example, the "Multiple Data" the term SIMD are the 1,000 items in each array or vector. The "Single Instruction" is the addition operation that we wish to perform on the elements of A and B. With an infinite hardware budget, we could store the A operands in 1,000 registers within the core, store the B operands in another 1,000 registers, feed the registers in parallel into 1,000 addition units which perform the add operation, and finally feed the results of the 1,000 addition units in parallel to 1,000 storage registers - all of this in one instruction cycle. Ah, idealism! Reality is that our transistor budget on our current generation silicon does not allow parallelism on this scale. Also one has to remember that any registers you provide to user processes have to be saved off and later restored during context switching. Our goal is simple: We want to load our operands into register files, use parallel arithmetic units to operate on the operands, and feed the results into registers or to memory.

There is another term we need to understand before proceeding. Vectorization or Vector Processing is a term that has been used in high performance computing for many years. It refers to a technique to load a set of registers, sometimes called a register file, with operands. After the operands are loaded, a single instruction is used to perform a mathematical operation on the operands. This differs from SIMD in that the mathematical operation specified by the single instruction is performed by sequentially streaming operands from the registers through the arithmetic unit and back into registers - usually with one mathematical operation per clock cycle. In SIMD a single instruction operates in parallel on the dataset. In vector processing, a single instruction operates on the operands in a register file in rapid sequence.

Inherent in vector processing is the assumption that vectors are large. Therefore, caching of this data should be avoided. If cache were used as a conduit between memory and the register file, accessing a large vector would quickly fill the cache and it would spill without reuse of any element of the vector. Thus, large vector streaming memory access patterns see the cache as nothing more than useless overhead. For vector processing, direct memory-to-register or cache-bypass techniques and instructions are typically used. These streaming instructions are part of the SSE instruction set.

SSE is a hybrid of a pure SIMD model and a pure vector model. SSE uses vectorization techniques to stream data directly from memory to and from SSE registers (the Streaming component of Streaming SIMD Extensions). These SSE registers act as a register file. However, the SSE registers pack several operands into each 128 bit register and operate on them as a set in a data-parallel SIMD model (the SIMD portion of SSE). For the remainder of this article we will refer to the process of compiling code to take advantage of SSE as vectorization.

We need to stop at this point for an important consideration: These techniques are only efficient when an application has enough operands to make the setup costs worthwhile. Setup costs include the time to load the registers with the elements of A and B from memory and the time to unload the elements of C from registers to memory. Looking at the DO loop above, if there are only 5 iterations of the loop ( operations on just 5 elements in each of A, B, and C) then the setup costs may exceed the speedup benefit of using the SIMD and vectorization techniques. Also, a loop may not be efficient if it contains too many instructions or conditionals that will break down the vectorization within the loop. Loops without enough iterations or with too many expressions that will break down the vectorization are termed inefficient.

Streaming SIMD support in Intel Core Duo Processor

The Intel Core Duo processor supports SIMD and vectorization with dedicated registers, arithmetic hardware, SIMD mathematical instructions to operate on the data in the SSE registers, and streaming (cache bypass) memory load and store instructions. Each core of the Core Duo processor has it's own dedicated SIMD hardware. Figure 2 illustrates the SSE hardware available in each core of the two cores in the Intel Core Duo processor. This hardware, along with the instructions that drive this special-purpose arithmetic resource is referred to as Streaming SIMD Extension, or SSE. SSE was designed and has evolved to accelerate integer and floating-point calculations. And while the intent of SSE was to accelerate common media operations, these same mathematical and data movement operations are applicable to a wide range of applications in technical computing, finance, signal processing, graphics, and gaming to name a few.



Figure 2: SSE registers and supported data types

SSE operands can be integer: from 1 byte through 8 byte integer types both signed and unsigned. Floating point data is supported in 32 or 64 bit IEEE format. As shown in Figure 2, the SSE registers are 128bits wide. Thus, these SSE supported data types are packed within the registers and operated upon in SIMD. Figure Operations on the data can be addition, subtraction, multiplication, division, and some transcendental functions such as sine and cosine.

Enabling Digital Media Boost Vectorization

The Intel(R) Fortran and C++ Compilers for Mac* OS allow the programmer to generate binaries that take full advantage of the Digital Media Boost technology. In fact, the Intel compilers will enable vectorization by default when the compiler is using optimization level 1 and above ( compiler options -O1 through -O3). Let's look at how to enable vectorization with the Intel compiler from the Xcode environment. We assume that the reader has installed the Intel Fortran or C++ Compiler for Mac* OS and has read through the chapter Build Applications with Xcode in the Fortran or C++ Compiler Documentation. One suggestion: it is best to keep the settings for optimization only in the Release configuration for the target(s). Optimization settings can adversely affect the ability to debug an application.

The first step to enable vectorization is to choose an optimization level of 1 or higher (compiler options -O1, -O2, or -O3). Highlight the target for your project, select Get Info from Action (see Figure 3). This brings up the Target Info window. Again, make sure you are working with the Release configuration for the Target.



Figure 3: Bring up Target Info

For the Collection pull-down, you have two choices. You can view all compiler settings by selecting the Intel(R) C++ (or Fortran) Compiler 9.1 collection (Figure 4). This gives you access to the entire set of compiler options for the Intel C++ or Fortran compiler. Or as another choice, you can select the General collection which is under the Intel C++ or Fortran compiler collection (Figure 5). This collection also has the Optimization settings.



Figure 4: All settings from the compiler collection

Choose any optimization other than None (-O0) and vectorization will be performed by the compiler. For the command line, simply use the compiler options -O1, -O2, or -O3 and you are now taking advantage of the SSE features of the Intel Core Duo processor. Or are you? The next logical question is "how do I know that the compiler vectorized my code?".



Figure 5: Optimization settings under General Collection

This brings us to examine how we determine whether or not the compiler is vectorizing individual loops. The Intel compilers provide a vectorization report option that provides two kinds of information: First, the vectorization report will inform you which loops within your code are being vectorized. The end result of a vectorized loop is an instruction stream for that loop that contains SSE instructions. This is essential information to verify that the compiler is indeed vectorizing the loops within the code that you expect it to vectorize. Secondly and what we find critically important, is report information about why the compiler did NOT vectorize a loop and why it did not vectorize a loop. This information assists a programmer by highlighting the barriers that the compiler finds to vectorization.



Figure 6: Enabling the vectorization report

With the Intel compilers, one must enable the vectorization reporting mechanism. It is not enabled by default. Within the Xcode environment, the vectorization report is enabled by selecting one of the vector reports in the setting Vectorizer Diagnostic Report from the Diagnostics collection for the Target (Figure 6). The vectorization report is viewed in the Build Results window. The report follows the compilation for each source file, as shown in Figure 7



Figure 7: Vectorization report

The vectorization report option, -vec-report=<n>, uses the argument <n> to specify the information presented; from no information at -vec-report=0 to very verbose information at -vec-report=5. The arguments to -vec-report are:

    n=0: No diagnostic information

    n=1: (Default) Loops successfully vectorized

    n=2: Loops not vectorized - and the reason why not

    n=3: Adds dependency Information

    n=4: Reports only non-vectorized loops

    n=5: Reports only non-vectorized loops and adds dependency info

Inhibitors to vectorization

The Intel compilers attempt to vectorize loops within the code. However, not all loops can be vectorized. There are too many cases to list in the space of this article. We will examine a few common scenarios where the compiler cannot vectorize a loop.

Outer Loops: When there are nested loops, the vectorization is applied to the innermost loop. Outer loops are never vectorized, so you can expect -vec-report to identify these cases. This can be seen by the output of vec-report=3 in the example below:

$ ifort -O3 -vec-report=2  -o md md.f
 ...
md.f(212) : (col. 7) remark: loop was not vectorized: not inner loop.
md.f(213) : (col. 9) remark: LOOP WAS VECTORIZED.
    ...
    212       do i = 1,np
    213         do j = 1,nd
    214           pos(j,i) = pos(j,i) + vel(j,i)*dt + 0.5*dt*dt*a(j,i)
    215           vel(j,i) = vel(j,i) + 0.5*dt*(f(j,i)*rmass + a(j,i))
    216           a(j,i) = f(j,i)*rmass
    217         enddo
    218       enddo

In this abbreviated example from a molecular dynamics code, we see from the vectorization report that only the inner loop, the do j=1,nd loop, is attempted to be vectorized.

Data Dependencies: In order to be candidates for vectorization, a loop cannot contain dependencies between loop interations. Dependencies occur when a strict ordering of the iterations must be enforced. Consider the following loop:

void scale(float* z) {
 float A; int i;
 A = 42.0; 
 for ( i=0; i<10000; i++ )
     z[i] = A * z[i-1];  }

Which when compiled gives:

$ icc -O3 -vec-report=2 -c depend.c
depend.c(4) : (col. 2) remark: loop was not vectorized: existence of vector dependence.  

Examining this, we see that in order to calculate the value to store in z[i] we need to have already calculated the value for z[i-1]. This forces a strict, sequential ordering to when the calculations must be performed. There are many other interesting cases to consider in dependency analysis and the reader is encourage to pursue this further by researching some of the references at the end of this article.

Function and Procedure calls: Another major inhibitor to vectorization is when the loop contains a function or procedure call. Consider this example:

      1 c   Pi:  Compute pi
      2 c
      3 c   Illustrates how to calculate the definite integral
      4 c   of a function f(x).
      5 c
      6 c   We integrate the function:
      7 c         f(x) = 4/(1+x**2)
      8 c   between the limits x=0 and x=1.
      9 c
     10 c   The result should approximate the value of pi.
     11 c   The method is the n-point rectangle quadrature rule.
     12         program computepi
     13         integer           n, i
     14         double precision  sum, pi, x, h, f
     15         external          f
     16         n = 1000000000
     17         h = 1.0/n
     18         sum = 0.0
     19         do 10 i = 1,n
     20            x = h*(i-0.5)
     21           sum = sum + f(x)
     22 10     continue
     23        pi = h*sum
     24        print *, 'pi is approximately : ', pi
     25        end

Within the do loop above, a function call to f(x) is made. In this example, the function f is in a separate source file. The code for f is as follow:

c   fx.f:  Integration function
   double precision function f(x)
     double precision x
        f =  (4/(1+x*x))
     end

When we attempt to compile these two source file with -vec-report, we get the following:

$ ifort -O3 -vec-report=2 -o pi pi.f fx.f
Pi.f(19) : ( col 12 ) remark: loop was not vectorized: contains unvectorizable statement at line 21

Looking at pi.f we see at line 19 there is a loop that is a candidate for vectorization. At line 21 we see the statement sum = sum + f(x). It is the call to the external function f(x) that is the issue. The external function may or may not contain data dependencies, thus the compiler makes the safe decision to not vectorize the loop

When one sees function or procedure calls within loops as in this example, the next logical step is to attempt to inline the function call. Inlining the function will allow the compiler to complete it's dependency analysis and often times allow vectorization of the loop. With the Intel compilers, options -ip and -ipo perform interprocedural optimizations. One of these optimization is function inlining. -ip is used to inline functions or procedures and perform optimizations that are contained within the same source file. -ipo is an advanced feature of the Intel compilers. With this option, the compilers are able to find inlining and optimization opportunities across source files, as in this example. In this case, fx.f is a separate file containing the function f(x). Compiling with -ipo gives:

$ ifort -O3 -ipo -vec-report=2 -o pi pi.f fx.f
IPO: performing multi-file optimizations
IPO: generating object file /tmp/ipo_ifort0FmkdQ.o
pi.f(19) : (col. 12) remark: LOOP WAS VECTORIZED.

The runtime of the non-vectorized program took 40 seconds on an iMac with a 1.83Ghz Intel Core Duo processor. The vectorized version took 17 seconds. We need to point out that this was a very trivial case. Deeply nested and complex procedure call trees that are called from within a loop will almost certainly never be able to be inlined.

Ill-defined loops: Compilers must be able to identify a loop and be able to determine the number of iterations, or trip count. Here are some example in C and Fortran:

 int count = 1;
    while (count <= 100){
        z[i] = x[i+1];
        count += 1;
 }
     I = 0
100  CONTINUE
     Z(I) = X(I+1)
     I = I + 1
     IF ( I .LT. 100 ) GOTO 100

Branching outside of the loop: whenever there is a conditional branch inside the loop this can disqualify the loop as a candidate for vectorization:

for ( int i=0; i<100 ; i++ ) {
   z[i] = x[i+1];
   if ( z[i] == 0 ) exit(-1);
}

Techniques to Improve Vectorization

We've already seen several techniques that improve vectorization. These include writing clearly defined loops that are easy for the compiler to recognize. Since vectorization is performed on inner loops, it is especially critical for these inner loops. Although it's counter to module programming techniques, for efficiency it is best to avoid deeply nested procedure calls inside of computational loops. Try to keep procedure calls to one level of nesting if at all possible. And although we did not mention this earlier, it is much easier for compilers to recognize inlining opportunities when functions and procedures are within the same source file. However, as we've seen, if you must have the functions in separate source files make sure you use the interprocedural optimization compiler switch, -ipo, provided by the Intel(R) Fortran Compiler and Intel(R) C++ Compiler for Mac OS.

Finally, instead of writing your own version of mathematical functions, where available use vectorized versions of libraries. As an example, the Intel Compilers for Mac OS ship with a short vector math library, libsvml. This library has vectorized versions of common math functions normally found in libm. The functions in libsvml include the common transcendental functions sin/cos/tan, asin/acos/atan as well as exp/pow, and ln/log10. In addition, the Intel compilers provide optimized memcpy, memcmp functions which are also quite prevalent thoughout any application. When using the Intel Compilers, this vectorized library will link prior to libm. Thus you will automatically link in vectorized versions of these common functions. Just remember to use the Intel drivers ( icc/icpc/ifort ) for compiling and linking and do NOT add -lm to the link arguments.

Finally, for more sophisticated mathematical, encryption, image processing and statistical functions, Intel provides two other library products. The Intel(R) Math Kernel Library (Intel(R) MKL) for Mac OS provides BLAS, FFT, and vectorized statistical libraries. These library functions are highly tuned and optimized to take maximum advantage the Digital Media Extensions of the Intel Core Duo processor. In addition to using SSE, these libraries are also multi-threaded to take advantage of both cores in the Intel Core Duo processor. Customers performing data compression, encryption, video encoding/decoding and speech processing will want to consider the Intel(R) Integrated Performance Primitives (Intel(R) IPP). Intel IPP routines are also highly tuned to utilize SSE.

Summary

The Streaming SIMD Extentions (SSE) architectural features of the Intel Core Duo processor enable integer and floating point acceleration for applications. SSE is a hybrid of traditional SIMD and vector processing methodologies. The Intel Fortran Compiler and Intel C++ Compiler refer to these techniques as vectorization. With the Intel Fortran Compiler and Intel C++ Compiler for Mac* OS, vectorization is enabled by default at optimization level 1 (-O1) and above. The Intel compilers also feature vectoriztion reporting via the -vec-report compiler option. Not only will the report list the location of loops vectorized, it will also list the locations of loops that were not vectorized and explain why it did not vectorize those loops. These hints enable the programmer to indentify vectorization inhibitors which can often be removed, leading to substantial performance improvements

Further Reading

A good place to start learning about SSE and advanced optimizations is in the Optimizing Applications chapter of the Intel C++ Compiler or Intel Fortran Compiler documentation which comes with the Intel Compilers for Mac OS. The SSE features of the Intel Core Duo processor are rich and extensive. So much so that a full treatment on this topic requires an entire book. The definitive guide to software vectorization and SSE is The Software Vectorization Handbook, Aart J.C. Bik, Intel Press, ISBN 0-9743649-2-4. If you are a programmer moving code from older Apple machines, using Altivec instructions, there are some excellent resources covering Altivec to SSE migration to be found on Apple's developer website (ADC).


Both authors are members of the Intel Compiler team. Ganesh Rao has been with Intel for over nine years and currently helps optimize applications to take advantage of the latest Intel(R) processors using the Intel(R) compilers.

Ron Wayne Green has been involved in Fortran and high-performance computing applications development and support for over twenty years and contributes to Fortran and technical computing issues.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Latest Forum Discussions

See All

Top Mobile Game Discounts
Every day, we pick out a curated list of the best mobile discounts on the App Store and post them here. This list won't be comprehensive, but it every game on it is recommended. Feel free to check out the coverage we did on them in the links... | Read more »
Price of Glory unleashes its 1.4 Alpha u...
As much as we all probably dislike Maths as a subject, we do have to hand it to geometry for giving us the good old Hexgrid, home of some of the best strategy games. One such example, Price of Glory, has dropped its 1.4 Alpha update, stocked full... | Read more »
The SLC 2025 kicks off this month to cro...
Ever since the Solo Leveling: Arise Championship 2025 was announced, I have been looking forward to it. The promotional clip they released a month or two back showed crowds going absolutely nuts for the previous competitions, so imagine the... | Read more »
Dive into some early Magicpunk fun as Cr...
Excellent news for fans of steampunk and magic; the Precursor Test for Magicpunk MMORPG Crystal of Atlan opens today. This rather fancy way of saying beta test will remain open until March 5th and is available for PC - boo - and Android devices -... | Read more »
Prepare to get your mind melted as Evang...
If you are a fan of sci-fi shooters and incredibly weird, mind-bending anime series, then you are in for a treat, as Goddess of Victory: Nikke is gearing up for its second collaboration with Evangelion. We were also treated to an upcoming... | Read more »
Square Enix gives with one hand and slap...
We have something of a mixed bag coming over from Square Enix HQ today. Two of their mobile games are revelling in life with new events keeping them alive, whilst another has been thrown onto the ever-growing discard pile Square is building. I... | Read more »
Let the world burn as you have some fest...
It is time to leave the world burning once again as you take a much-needed break from that whole “hero” lark and enjoy some celebrations in Genshin Impact. Version 5.4, Moonlight Amidst Dreams, will see you in Inazuma to attend the Mikawa Flower... | Read more »
Full Moon Over the Abyssal Sea lands on...
Aether Gazer has announced its latest major update, and it is one of the loveliest event names I have ever heard. Full Moon Over the Abyssal Sea is an amazing name, and it comes loaded with two side stories, a new S-grade Modifier, and some fancy... | Read more »
Open your own eatery for all the forest...
Very important question; when you read the title Zoo Restaurant, do you also immediately think of running a restaurant in which you cook Zoo animals as the course? I will just assume yes. Anyway, come June 23rd we will all be able to start up our... | Read more »
Crystal of Atlan opens registration for...
Nuverse was prominently featured in the last month for all the wrong reasons with the USA TikTok debacle, but now it is putting all that behind it and preparing for the Crystal of Atlan beta test. Taking place between February 18th and March 5th,... | Read more »

Price Scanner via MacPrices.net

AT&T is offering a 65% discount on the ne...
AT&T is offering the new iPhone 16e for up to 65% off their monthly finance fee with 36-months of service. No trade-in is required. Discount is applied via monthly bill credits over the 36 month... Read more
Use this code to get a free iPhone 13 at Visi...
For a limited time, use code SWEETDEAL to get a free 128GB iPhone 13 Visible, Verizon’s low-cost wireless cell service, Visible. Deal is valid when you purchase the Visible+ annual plan. Free... Read more
M4 Mac minis on sale for $50-$80 off MSRP at...
B&H Photo has M4 Mac minis in stock and on sale right now for $50 to $80 off Apple’s MSRP, each including free 1-2 day shipping to most US addresses: – M4 Mac mini (16GB/256GB): $549, $50 off... Read more
Buy an iPhone 16 at Boost Mobile and get one...
Boost Mobile, an MVNO using AT&T and T-Mobile’s networks, is offering one year of free Unlimited service with the purchase of any iPhone 16. Purchase the iPhone at standard MSRP, and then choose... Read more
Get an iPhone 15 for only $299 at Boost Mobil...
Boost Mobile, an MVNO using AT&T and T-Mobile’s networks, is offering the 128GB iPhone 15 for $299.99 including service with their Unlimited Premium plan (50GB of premium data, $60/month), or $20... Read more
Unreal Mobile is offering $100 off any new iP...
Unreal Mobile, an MVNO using AT&T and T-Mobile’s networks, is offering a $100 discount on any new iPhone with service. This includes new iPhone 16 models as well as iPhone 15, 14, 13, and SE... Read more
Apple drops prices on clearance iPhone 14 mod...
With today’s introduction of the new iPhone 16e, Apple has discontinued the iPhone 14, 14 Pro, and SE. In response, Apple has dropped prices on unlocked, Certified Refurbished, iPhone 14 models to a... Read more
B&H has 16-inch M4 Max MacBook Pros on sa...
B&H Photo is offering a $360-$410 discount on new 16-inch MacBook Pros with M4 Max CPUs right now. B&H offers free 1-2 day shipping to most US addresses: – 16″ M4 Max MacBook Pro (36GB/1TB/... Read more
Amazon is offering a $100 discount on the M4...
Amazon has the M4 Pro Mac mini discounted $100 off MSRP right now. Shipping is free. Their price is the lowest currently available for this popular mini: – Mac mini M4 Pro (24GB/512GB): $1299, $100... Read more
B&H continues to offer $150-$220 discount...
B&H Photo has 14-inch M4 MacBook Pros on sale for $150-$220 off MSRP. B&H offers free 1-2 day shipping to most US addresses: – 14″ M4 MacBook Pro (16GB/512GB): $1449, $150 off MSRP – 14″ M4... Read more

Jobs Board

All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.