TweetFollow Us on Twitter

Digital Media Boost With the Intel Core Duo Processor

Volume Number: 22 (2006)
Issue Number: 8
Column Tag: Performance Optimization

Digital Media Boost With the Intel Core Duo Processor

Extracting maximum performance from your applications

by Ron Wayne Green and Ganesh Rao

Introduction

This is the second part of a three part series that addresses the most effective techniques to optimize applications for Intel(R) Core(TM) Duo processor-based Apple Macintosh computers. Part one of this series introduced the key aspects of the Intel Core Duo processor and exposed the architectural features for which tuning is most important. Also presented in that first article was a data-driven performance methodology using the software development tools available on an Intel processor-based Apple Macintosh to highlight tuning and optimization opportunities. This article, the second part of this 3-part series, introduces the Intel(R) Digital Media Boost technology of the Intel Core Duo processor, its capabilities, and how a programmer can exploit this computing power. The final part of this three-part series to come in a future MacTech issue will provide readers with the next level of optimization - taking advantage of both execution cores in the Intel Core Duo processor.

In this article, we examine the Intel Digital Media Boost enhancements to the Streaming SIMD Extension (SSE) features of the Intel Core Duo processor. We also describe how to direct the Intel compilers to leverage these features for optimal application performance. Finally, we will examine inhibitors to the use of these advanced hardware features and how to remove some of these inhibitors. Examples will be illustrated with C++ and Fortran code snippets.

Goal: Integer and Floating Point Calculations

Before we dive into the details of SSE and Intel Digital Media Boost, let's understand our goals for this article. Reviewing our high-level diagram of the Intel Core Duo processor, Figure 1, we see that the processor has two cores. Each core is a full-feature, tradition CPU which includes registers, instruction pipeline and execution unit, and advanced integer and floating point arithmetic units. Our goal for this article is to focus on a single core (either one as they are equivalent) and look at the hardware provided to accelerate integer and floating point calculations.



Figure 1: Intel(R) Core(TM) Duo processor architecture

SIMD: A methodology for performing calculations in Parallel

Single-Instruction, Multiple Data (SIMD) is a methodology for performing the same mathematical operation on a data set. Imagine that you have 1,000 elements in two rank-1 arrays, or vectors in mathematical terms, and you wish to add the elements of the arrays. Let's call the operand arrays A and B, and we wish to store them in a third array, C, as shown below:

real, dimension(1000) :: a, b, c
do  i=1,1000
   c(i) = a(i) + b(i)
end do

Or the equivalent loop expressed in Fortran 90 array syntax:

c = a + b

For this example, the "Multiple Data" the term SIMD are the 1,000 items in each array or vector. The "Single Instruction" is the addition operation that we wish to perform on the elements of A and B. With an infinite hardware budget, we could store the A operands in 1,000 registers within the core, store the B operands in another 1,000 registers, feed the registers in parallel into 1,000 addition units which perform the add operation, and finally feed the results of the 1,000 addition units in parallel to 1,000 storage registers - all of this in one instruction cycle. Ah, idealism! Reality is that our transistor budget on our current generation silicon does not allow parallelism on this scale. Also one has to remember that any registers you provide to user processes have to be saved off and later restored during context switching. Our goal is simple: We want to load our operands into register files, use parallel arithmetic units to operate on the operands, and feed the results into registers or to memory.

There is another term we need to understand before proceeding. Vectorization or Vector Processing is a term that has been used in high performance computing for many years. It refers to a technique to load a set of registers, sometimes called a register file, with operands. After the operands are loaded, a single instruction is used to perform a mathematical operation on the operands. This differs from SIMD in that the mathematical operation specified by the single instruction is performed by sequentially streaming operands from the registers through the arithmetic unit and back into registers - usually with one mathematical operation per clock cycle. In SIMD a single instruction operates in parallel on the dataset. In vector processing, a single instruction operates on the operands in a register file in rapid sequence.

Inherent in vector processing is the assumption that vectors are large. Therefore, caching of this data should be avoided. If cache were used as a conduit between memory and the register file, accessing a large vector would quickly fill the cache and it would spill without reuse of any element of the vector. Thus, large vector streaming memory access patterns see the cache as nothing more than useless overhead. For vector processing, direct memory-to-register or cache-bypass techniques and instructions are typically used. These streaming instructions are part of the SSE instruction set.

SSE is a hybrid of a pure SIMD model and a pure vector model. SSE uses vectorization techniques to stream data directly from memory to and from SSE registers (the Streaming component of Streaming SIMD Extensions). These SSE registers act as a register file. However, the SSE registers pack several operands into each 128 bit register and operate on them as a set in a data-parallel SIMD model (the SIMD portion of SSE). For the remainder of this article we will refer to the process of compiling code to take advantage of SSE as vectorization.

We need to stop at this point for an important consideration: These techniques are only efficient when an application has enough operands to make the setup costs worthwhile. Setup costs include the time to load the registers with the elements of A and B from memory and the time to unload the elements of C from registers to memory. Looking at the DO loop above, if there are only 5 iterations of the loop ( operations on just 5 elements in each of A, B, and C) then the setup costs may exceed the speedup benefit of using the SIMD and vectorization techniques. Also, a loop may not be efficient if it contains too many instructions or conditionals that will break down the vectorization within the loop. Loops without enough iterations or with too many expressions that will break down the vectorization are termed inefficient.

Streaming SIMD support in Intel Core Duo Processor

The Intel Core Duo processor supports SIMD and vectorization with dedicated registers, arithmetic hardware, SIMD mathematical instructions to operate on the data in the SSE registers, and streaming (cache bypass) memory load and store instructions. Each core of the Core Duo processor has it's own dedicated SIMD hardware. Figure 2 illustrates the SSE hardware available in each core of the two cores in the Intel Core Duo processor. This hardware, along with the instructions that drive this special-purpose arithmetic resource is referred to as Streaming SIMD Extension, or SSE. SSE was designed and has evolved to accelerate integer and floating-point calculations. And while the intent of SSE was to accelerate common media operations, these same mathematical and data movement operations are applicable to a wide range of applications in technical computing, finance, signal processing, graphics, and gaming to name a few.



Figure 2: SSE registers and supported data types

SSE operands can be integer: from 1 byte through 8 byte integer types both signed and unsigned. Floating point data is supported in 32 or 64 bit IEEE format. As shown in Figure 2, the SSE registers are 128bits wide. Thus, these SSE supported data types are packed within the registers and operated upon in SIMD. Figure Operations on the data can be addition, subtraction, multiplication, division, and some transcendental functions such as sine and cosine.

Enabling Digital Media Boost Vectorization

The Intel(R) Fortran and C++ Compilers for Mac* OS allow the programmer to generate binaries that take full advantage of the Digital Media Boost technology. In fact, the Intel compilers will enable vectorization by default when the compiler is using optimization level 1 and above ( compiler options -O1 through -O3). Let's look at how to enable vectorization with the Intel compiler from the Xcode environment. We assume that the reader has installed the Intel Fortran or C++ Compiler for Mac* OS and has read through the chapter Build Applications with Xcode in the Fortran or C++ Compiler Documentation. One suggestion: it is best to keep the settings for optimization only in the Release configuration for the target(s). Optimization settings can adversely affect the ability to debug an application.

The first step to enable vectorization is to choose an optimization level of 1 or higher (compiler options -O1, -O2, or -O3). Highlight the target for your project, select Get Info from Action (see Figure 3). This brings up the Target Info window. Again, make sure you are working with the Release configuration for the Target.



Figure 3: Bring up Target Info

For the Collection pull-down, you have two choices. You can view all compiler settings by selecting the Intel(R) C++ (or Fortran) Compiler 9.1 collection (Figure 4). This gives you access to the entire set of compiler options for the Intel C++ or Fortran compiler. Or as another choice, you can select the General collection which is under the Intel C++ or Fortran compiler collection (Figure 5). This collection also has the Optimization settings.



Figure 4: All settings from the compiler collection

Choose any optimization other than None (-O0) and vectorization will be performed by the compiler. For the command line, simply use the compiler options -O1, -O2, or -O3 and you are now taking advantage of the SSE features of the Intel Core Duo processor. Or are you? The next logical question is "how do I know that the compiler vectorized my code?".



Figure 5: Optimization settings under General Collection

This brings us to examine how we determine whether or not the compiler is vectorizing individual loops. The Intel compilers provide a vectorization report option that provides two kinds of information: First, the vectorization report will inform you which loops within your code are being vectorized. The end result of a vectorized loop is an instruction stream for that loop that contains SSE instructions. This is essential information to verify that the compiler is indeed vectorizing the loops within the code that you expect it to vectorize. Secondly and what we find critically important, is report information about why the compiler did NOT vectorize a loop and why it did not vectorize a loop. This information assists a programmer by highlighting the barriers that the compiler finds to vectorization.



Figure 6: Enabling the vectorization report

With the Intel compilers, one must enable the vectorization reporting mechanism. It is not enabled by default. Within the Xcode environment, the vectorization report is enabled by selecting one of the vector reports in the setting Vectorizer Diagnostic Report from the Diagnostics collection for the Target (Figure 6). The vectorization report is viewed in the Build Results window. The report follows the compilation for each source file, as shown in Figure 7



Figure 7: Vectorization report

The vectorization report option, -vec-report=<n>, uses the argument <n> to specify the information presented; from no information at -vec-report=0 to very verbose information at -vec-report=5. The arguments to -vec-report are:

    n=0: No diagnostic information

    n=1: (Default) Loops successfully vectorized

    n=2: Loops not vectorized - and the reason why not

    n=3: Adds dependency Information

    n=4: Reports only non-vectorized loops

    n=5: Reports only non-vectorized loops and adds dependency info

Inhibitors to vectorization

The Intel compilers attempt to vectorize loops within the code. However, not all loops can be vectorized. There are too many cases to list in the space of this article. We will examine a few common scenarios where the compiler cannot vectorize a loop.

Outer Loops: When there are nested loops, the vectorization is applied to the innermost loop. Outer loops are never vectorized, so you can expect -vec-report to identify these cases. This can be seen by the output of vec-report=3 in the example below:

$ ifort -O3 -vec-report=2  -o md md.f
 ...
md.f(212) : (col. 7) remark: loop was not vectorized: not inner loop.
md.f(213) : (col. 9) remark: LOOP WAS VECTORIZED.
    ...
    212       do i = 1,np
    213         do j = 1,nd
    214           pos(j,i) = pos(j,i) + vel(j,i)*dt + 0.5*dt*dt*a(j,i)
    215           vel(j,i) = vel(j,i) + 0.5*dt*(f(j,i)*rmass + a(j,i))
    216           a(j,i) = f(j,i)*rmass
    217         enddo
    218       enddo

In this abbreviated example from a molecular dynamics code, we see from the vectorization report that only the inner loop, the do j=1,nd loop, is attempted to be vectorized.

Data Dependencies: In order to be candidates for vectorization, a loop cannot contain dependencies between loop interations. Dependencies occur when a strict ordering of the iterations must be enforced. Consider the following loop:

void scale(float* z) {
 float A; int i;
 A = 42.0; 
 for ( i=0; i<10000; i++ )
     z[i] = A * z[i-1];  }

Which when compiled gives:

$ icc -O3 -vec-report=2 -c depend.c
depend.c(4) : (col. 2) remark: loop was not vectorized: existence of vector dependence.  

Examining this, we see that in order to calculate the value to store in z[i] we need to have already calculated the value for z[i-1]. This forces a strict, sequential ordering to when the calculations must be performed. There are many other interesting cases to consider in dependency analysis and the reader is encourage to pursue this further by researching some of the references at the end of this article.

Function and Procedure calls: Another major inhibitor to vectorization is when the loop contains a function or procedure call. Consider this example:

      1 c   Pi:  Compute pi
      2 c
      3 c   Illustrates how to calculate the definite integral
      4 c   of a function f(x).
      5 c
      6 c   We integrate the function:
      7 c         f(x) = 4/(1+x**2)
      8 c   between the limits x=0 and x=1.
      9 c
     10 c   The result should approximate the value of pi.
     11 c   The method is the n-point rectangle quadrature rule.
     12         program computepi
     13         integer           n, i
     14         double precision  sum, pi, x, h, f
     15         external          f
     16         n = 1000000000
     17         h = 1.0/n
     18         sum = 0.0
     19         do 10 i = 1,n
     20            x = h*(i-0.5)
     21           sum = sum + f(x)
     22 10     continue
     23        pi = h*sum
     24        print *, 'pi is approximately : ', pi
     25        end

Within the do loop above, a function call to f(x) is made. In this example, the function f is in a separate source file. The code for f is as follow:

c   fx.f:  Integration function
   double precision function f(x)
     double precision x
        f =  (4/(1+x*x))
     end

When we attempt to compile these two source file with -vec-report, we get the following:

$ ifort -O3 -vec-report=2 -o pi pi.f fx.f
Pi.f(19) : ( col 12 ) remark: loop was not vectorized: contains unvectorizable statement at line 21

Looking at pi.f we see at line 19 there is a loop that is a candidate for vectorization. At line 21 we see the statement sum = sum + f(x). It is the call to the external function f(x) that is the issue. The external function may or may not contain data dependencies, thus the compiler makes the safe decision to not vectorize the loop

When one sees function or procedure calls within loops as in this example, the next logical step is to attempt to inline the function call. Inlining the function will allow the compiler to complete it's dependency analysis and often times allow vectorization of the loop. With the Intel compilers, options -ip and -ipo perform interprocedural optimizations. One of these optimization is function inlining. -ip is used to inline functions or procedures and perform optimizations that are contained within the same source file. -ipo is an advanced feature of the Intel compilers. With this option, the compilers are able to find inlining and optimization opportunities across source files, as in this example. In this case, fx.f is a separate file containing the function f(x). Compiling with -ipo gives:

$ ifort -O3 -ipo -vec-report=2 -o pi pi.f fx.f
IPO: performing multi-file optimizations
IPO: generating object file /tmp/ipo_ifort0FmkdQ.o
pi.f(19) : (col. 12) remark: LOOP WAS VECTORIZED.

The runtime of the non-vectorized program took 40 seconds on an iMac with a 1.83Ghz Intel Core Duo processor. The vectorized version took 17 seconds. We need to point out that this was a very trivial case. Deeply nested and complex procedure call trees that are called from within a loop will almost certainly never be able to be inlined.

Ill-defined loops: Compilers must be able to identify a loop and be able to determine the number of iterations, or trip count. Here are some example in C and Fortran:

 int count = 1;
    while (count <= 100){
        z[i] = x[i+1];
        count += 1;
 }
     I = 0
100  CONTINUE
     Z(I) = X(I+1)
     I = I + 1
     IF ( I .LT. 100 ) GOTO 100

Branching outside of the loop: whenever there is a conditional branch inside the loop this can disqualify the loop as a candidate for vectorization:

for ( int i=0; i<100 ; i++ ) {
   z[i] = x[i+1];
   if ( z[i] == 0 ) exit(-1);
}

Techniques to Improve Vectorization

We've already seen several techniques that improve vectorization. These include writing clearly defined loops that are easy for the compiler to recognize. Since vectorization is performed on inner loops, it is especially critical for these inner loops. Although it's counter to module programming techniques, for efficiency it is best to avoid deeply nested procedure calls inside of computational loops. Try to keep procedure calls to one level of nesting if at all possible. And although we did not mention this earlier, it is much easier for compilers to recognize inlining opportunities when functions and procedures are within the same source file. However, as we've seen, if you must have the functions in separate source files make sure you use the interprocedural optimization compiler switch, -ipo, provided by the Intel(R) Fortran Compiler and Intel(R) C++ Compiler for Mac OS.

Finally, instead of writing your own version of mathematical functions, where available use vectorized versions of libraries. As an example, the Intel Compilers for Mac OS ship with a short vector math library, libsvml. This library has vectorized versions of common math functions normally found in libm. The functions in libsvml include the common transcendental functions sin/cos/tan, asin/acos/atan as well as exp/pow, and ln/log10. In addition, the Intel compilers provide optimized memcpy, memcmp functions which are also quite prevalent thoughout any application. When using the Intel Compilers, this vectorized library will link prior to libm. Thus you will automatically link in vectorized versions of these common functions. Just remember to use the Intel drivers ( icc/icpc/ifort ) for compiling and linking and do NOT add -lm to the link arguments.

Finally, for more sophisticated mathematical, encryption, image processing and statistical functions, Intel provides two other library products. The Intel(R) Math Kernel Library (Intel(R) MKL) for Mac OS provides BLAS, FFT, and vectorized statistical libraries. These library functions are highly tuned and optimized to take maximum advantage the Digital Media Extensions of the Intel Core Duo processor. In addition to using SSE, these libraries are also multi-threaded to take advantage of both cores in the Intel Core Duo processor. Customers performing data compression, encryption, video encoding/decoding and speech processing will want to consider the Intel(R) Integrated Performance Primitives (Intel(R) IPP). Intel IPP routines are also highly tuned to utilize SSE.

Summary

The Streaming SIMD Extentions (SSE) architectural features of the Intel Core Duo processor enable integer and floating point acceleration for applications. SSE is a hybrid of traditional SIMD and vector processing methodologies. The Intel Fortran Compiler and Intel C++ Compiler refer to these techniques as vectorization. With the Intel Fortran Compiler and Intel C++ Compiler for Mac* OS, vectorization is enabled by default at optimization level 1 (-O1) and above. The Intel compilers also feature vectoriztion reporting via the -vec-report compiler option. Not only will the report list the location of loops vectorized, it will also list the locations of loops that were not vectorized and explain why it did not vectorize those loops. These hints enable the programmer to indentify vectorization inhibitors which can often be removed, leading to substantial performance improvements

Further Reading

A good place to start learning about SSE and advanced optimizations is in the Optimizing Applications chapter of the Intel C++ Compiler or Intel Fortran Compiler documentation which comes with the Intel Compilers for Mac OS. The SSE features of the Intel Core Duo processor are rich and extensive. So much so that a full treatment on this topic requires an entire book. The definitive guide to software vectorization and SSE is The Software Vectorization Handbook, Aart J.C. Bik, Intel Press, ISBN 0-9743649-2-4. If you are a programmer moving code from older Apple machines, using Altivec instructions, there are some excellent resources covering Altivec to SSE migration to be found on Apple's developer website (ADC).


Both authors are members of the Intel Compiler team. Ganesh Rao has been with Intel for over nine years and currently helps optimize applications to take advantage of the latest Intel(R) processors using the Intel(R) compilers.

Ron Wayne Green has been involved in Fortran and high-performance computing applications development and support for over twenty years and contributes to Fortran and technical computing issues.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Latest Forum Discussions

See All

Six fantastic ways to spend National Vid...
As if anyone needed an excuse to play games today, I am about to give you one: it is National Video Games Day. A day for us to play games, like we no doubt do every day. Let’s not look a gift horse in the mouth. Instead, feast your eyes on this... | Read more »
Old School RuneScape players turn out in...
The sheer leap in technological advancements in our lifetime has been mind-blowing. We went from Commodore 64s to VR glasses in what feels like a heartbeat, but more importantly, the internet. It can be a dark mess, but it also brought hundreds of... | Read more »
Today's Best Mobile Game Discounts...
Every day, we pick out a curated list of the best mobile discounts on the App Store and post them here. This list won't be comprehensive, but it every game on it is recommended. Feel free to check out the coverage we did on them in the links below... | Read more »
Nintendo and The Pokémon Company's...
Unless you have been living under a rock, you know that Nintendo has been locked in an epic battle with Pocketpair, creator of the obvious Pokémon rip-off Palworld. Nintendo often resorts to legal retaliation at the drop of a hat, but it seems this... | Read more »
Apple exclusive mobile games don’t make...
If you are a gamer on phones, no doubt you have been as distressed as I am on one huge sticking point: exclusivity. For years, Xbox and PlayStation have done battle, and before this was the Sega Genesis and the Nintendo NES. On console, it makes... | Read more »
Regionally exclusive events make no sens...
Last week, over on our sister site AppSpy, I babbled excitedly about the Pokémon GO Safari Days event. You can get nine Eevees with an explorer hat per day. Or, can you? Specifically, you, reader. Do you have the time or funds to possibly fly for... | Read more »
As Jon Bellamy defends his choice to can...
Back in March, Jagex announced the appointment of a new CEO, Jon Bellamy. Mr Bellamy then decided to almost immediately paint a huge target on his back by cancelling the Runescapes Pride event. This led to widespread condemnation about his perceived... | Read more »
Marvel Contest of Champions adds two mor...
When I saw the latest two Marvel Contest of Champions characters, I scoffed. Mr Knight and Silver Samurai, thought I, they are running out of good choices. Then I realised no, I was being far too cynical. This is one of the things that games do best... | Read more »
Grass is green, and water is wet: Pokémo...
It must be a day that ends in Y, because Pokémon Trading Card Game Pocket has kicked off its Zoroark Drop Event. Here you can get a promo version of another card, and look forward to the next Wonder Pick Event and the next Mass Outbreak that will be... | Read more »
Enter the Gungeon review
It took me a minute to get around to reviewing this game for a couple of very good reasons. The first is that Enter the Gungeon's style of roguelike bullet-hell action is teetering on the edge of being straight-up malicious, which made getting... | Read more »

Price Scanner via MacPrices.net

Take $150 off every Apple 11-inch M3 iPad Air
Amazon is offering a $150 discount on 11-inch M3 WiFi iPad Airs right now. Shipping is free: – 11″ 128GB M3 WiFi iPad Air: $449, $150 off – 11″ 256GB M3 WiFi iPad Air: $549, $150 off – 11″ 512GB M3... Read more
Apple iPad minis back on sale for $100 off MS...
Amazon is offering $100 discounts (up to 20% off) on Apple’s newest 2024 WiFi iPad minis, each with free shipping. These are the lowest prices available for new minis among the Apple retailers we... Read more
Apple’s 16-inch M4 Max MacBook Pros are on sa...
Amazon has 16-inch M4 Max MacBook Pros (Silver and Black colors) on sale for up to $410 off Apple’s MSRP right now. Shipping is free. Be sure to select Amazon as the seller, rather than a third-party... Read more
Red Pocket Mobile is offering a $150 rebate o...
Red Pocket Mobile has new Apple iPhone 17’s on sale for $150 off MSRP when you switch and open up a new line of service. Red Pocket Mobile is a nationwide MVNO using all the major wireless carrier... Read more
Switch to Verizon, and get any iPhone 16 for...
With yesterday’s introduction of the new iPhone 17 models, Verizon responded by running “on us” promos across much of the iPhone 16 lineup: iPhone 16 and 16 Plus show as $0/mo for 36 months with bill... Read more
Here is a summary of the new features in Appl...
Apple’s September 2025 event introduced major updates across its most popular product lines, focusing on health, performance, and design breakthroughs. The AirPods Pro 3 now feature best-in-class... Read more
Apple’s Smartphone Lineup Could Use A Touch o...
COMMENTARY – Whatever happened to the old adage, “less is more”? Apple’s smartphone lineup. — which is due for its annual refresh either this month or next (possibly at an Apple Event on September 9... Read more
Take $50 off every 11th-generation A16 WiFi i...
Amazon has Apple’s 11th-generation A16 WiFi iPads in stock on sale for $50 off MSRP right now. Shipping is free: – 11″ 11th-generation 128GB WiFi iPads: $299 $50 off MSRP – 11″ 11th-generation 256GB... Read more
Sunday Sale: 14-inch M4 MacBook Pros for up t...
Don’t pay full price! Amazon has Apple’s 14-inch M4 MacBook Pros (Silver and Black colors) on sale for up to $220 off MSRP right now. Shipping is free. Be sure to select Amazon as the seller, rather... Read more
Mac mini with M4 Pro CPU back on sale for $12...
B&H Photo has Apple’s Mac mini with the M4 Pro CPU back on sale for $1259, $140 off MSRP. B&H offers free 1-2 day shipping to most US addresses: – Mac mini M4 Pro CPU (24GB/512GB): $1259, $... Read more

Jobs Board

All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.