Digital Media Boost With the Intel Core Duo Processor

Volume Number: 22 (2006)
Issue Number: 8
Column Tag: Performance Optimization

Digital Media Boost With the Intel Core Duo Processor

Extracting maximum performance from your applications

by Ron Wayne Green and Ganesh Rao

Introduction

This is the second part of a three part series that addresses the most effective techniques to optimize applications for Intel(R) Core(TM) Duo processor-based Apple Macintosh computers. Part one of this series introduced the key aspects of the Intel Core Duo processor and exposed the architectural features for which tuning is most important. Also presented in that first article was a data-driven performance methodology using the software development tools available on an Intel processor-based Apple Macintosh to highlight tuning and optimization opportunities. This article, the second part of this 3-part series, introduces the Intel(R) Digital Media Boost technology of the Intel Core Duo processor, its capabilities, and how a programmer can exploit this computing power. The final part of this three-part series to come in a future MacTech issue will provide readers with the next level of optimization - taking advantage of both execution cores in the Intel Core Duo processor.

In this article, we examine the Intel Digital Media Boost enhancements to the Streaming SIMD Extension (SSE) features of the Intel Core Duo processor. We also describe how to direct the Intel compilers to leverage these features for optimal application performance. Finally, we will examine inhibitors to the use of these advanced hardware features and how to remove some of these inhibitors. Examples will be illustrated with C++ and Fortran code snippets.

Goal: Integer and Floating Point Calculations

Before we dive into the details of SSE and Intel Digital Media Boost, let's understand our goals for this article. Reviewing our high-level diagram of the Intel Core Duo processor, Figure 1, we see that the processor has two cores. Each core is a full-feature, tradition CPU which includes registers, instruction pipeline and execution unit, and advanced integer and floating point arithmetic units. Our goal for this article is to focus on a single core (either one as they are equivalent) and look at the hardware provided to accelerate integer and floating point calculations.

Figure 1: Intel(R) Core(TM) Duo processor architecture

SIMD: A methodology for performing calculations in Parallel

Single-Instruction, Multiple Data (SIMD) is a methodology for performing the same mathematical operation on a data set. Imagine that you have 1,000 elements in two rank-1 arrays, or vectors in mathematical terms, and you wish to add the elements of the arrays. Let's call the operand arrays A and B, and we wish to store them in a third array, C, as shown below:

real, dimension(1000) :: a, b, c
do  i=1,1000
   c(i) = a(i) + b(i)
end do

Or the equivalent loop expressed in Fortran 90 array syntax:

c = a + b

For this example, the "Multiple Data" the term SIMD are the 1,000 items in each array or vector. The "Single Instruction" is the addition operation that we wish to perform on the elements of A and B. With an infinite hardware budget, we could store the A operands in 1,000 registers within the core, store the B operands in another 1,000 registers, feed the registers in parallel into 1,000 addition units which perform the add operation, and finally feed the results of the 1,000 addition units in parallel to 1,000 storage registers - all of this in one instruction cycle. Ah, idealism! Reality is that our transistor budget on our current generation silicon does not allow parallelism on this scale. Also one has to remember that any registers you provide to user processes have to be saved off and later restored during context switching. Our goal is simple: We want to load our operands into register files, use parallel arithmetic units to operate on the operands, and feed the results into registers or to memory.

There is another term we need to understand before proceeding. Vectorization or Vector Processing is a term that has been used in high performance computing for many years. It refers to a technique to load a set of registers, sometimes called a register file, with operands. After the operands are loaded, a single instruction is used to perform a mathematical operation on the operands. This differs from SIMD in that the mathematical operation specified by the single instruction is performed by sequentially streaming operands from the registers through the arithmetic unit and back into registers - usually with one mathematical operation per clock cycle. In SIMD a single instruction operates in parallel on the dataset. In vector processing, a single instruction operates on the operands in a register file in rapid sequence.

Inherent in vector processing is the assumption that vectors are large. Therefore, caching of this data should be avoided. If cache were used as a conduit between memory and the register file, accessing a large vector would quickly fill the cache and it would spill without reuse of any element of the vector. Thus, large vector streaming memory access patterns see the cache as nothing more than useless overhead. For vector processing, direct memory-to-register or cache-bypass techniques and instructions are typically used. These streaming instructions are part of the SSE instruction set.

SSE is a hybrid of a pure SIMD model and a pure vector model. SSE uses vectorization techniques to stream data directly from memory to and from SSE registers (the Streaming component of Streaming SIMD Extensions). These SSE registers act as a register file. However, the SSE registers pack several operands into each 128 bit register and operate on them as a set in a data-parallel SIMD model (the SIMD portion of SSE). For the remainder of this article we will refer to the process of compiling code to take advantage of SSE as vectorization.

We need to stop at this point for an important consideration: These techniques are only efficient when an application has enough operands to make the setup costs worthwhile. Setup costs include the time to load the registers with the elements of A and B from memory and the time to unload the elements of C from registers to memory. Looking at the DO loop above, if there are only 5 iterations of the loop ( operations on just 5 elements in each of A, B, and C) then the setup costs may exceed the speedup benefit of using the SIMD and vectorization techniques. Also, a loop may not be efficient if it contains too many instructions or conditionals that will break down the vectorization within the loop. Loops without enough iterations or with too many expressions that will break down the vectorization are termed inefficient.

Streaming SIMD support in Intel Core Duo Processor

The Intel Core Duo processor supports SIMD and vectorization with dedicated registers, arithmetic hardware, SIMD mathematical instructions to operate on the data in the SSE registers, and streaming (cache bypass) memory load and store instructions. Each core of the Core Duo processor has it's own dedicated SIMD hardware. Figure 2 illustrates the SSE hardware available in each core of the two cores in the Intel Core Duo processor. This hardware, along with the instructions that drive this special-purpose arithmetic resource is referred to as Streaming SIMD Extension, or SSE. SSE was designed and has evolved to accelerate integer and floating-point calculations. And while the intent of SSE was to accelerate common media operations, these same mathematical and data movement operations are applicable to a wide range of applications in technical computing, finance, signal processing, graphics, and gaming to name a few.

Figure 2: SSE registers and supported data types

SSE operands can be integer: from 1 byte through 8 byte integer types both signed and unsigned. Floating point data is supported in 32 or 64 bit IEEE format. As shown in Figure 2, the SSE registers are 128bits wide. Thus, these SSE supported data types are packed within the registers and operated upon in SIMD. Figure Operations on the data can be addition, subtraction, multiplication, division, and some transcendental functions such as sine and cosine.

Enabling Digital Media Boost Vectorization

The Intel(R) Fortran and C++ Compilers for Mac* OS allow the programmer to generate binaries that take full advantage of the Digital Media Boost technology. In fact, the Intel compilers will enable vectorization by default when the compiler is using optimization level 1 and above ( compiler options -O1 through -O3). Let's look at how to enable vectorization with the Intel compiler from the Xcode environment. We assume that the reader has installed the Intel Fortran or C++ Compiler for Mac* OS and has read through the chapter Build Applications with Xcode in the Fortran or C++ Compiler Documentation. One suggestion: it is best to keep the settings for optimization only in the Release configuration for the target(s). Optimization settings can adversely affect the ability to debug an application.

The first step to enable vectorization is to choose an optimization level of 1 or higher (compiler options -O1, -O2, or -O3). Highlight the target for your project, select Get Info from Action (see Figure 3). This brings up the Target Info window. Again, make sure you are working with the Release configuration for the Target.

Figure 3: Bring up Target Info

For the Collection pull-down, you have two choices. You can view all compiler settings by selecting the Intel(R) C++ (or Fortran) Compiler 9.1 collection (Figure 4). This gives you access to the entire set of compiler options for the Intel C++ or Fortran compiler. Or as another choice, you can select the General collection which is under the Intel C++ or Fortran compiler collection (Figure 5). This collection also has the Optimization settings.

Figure 4: All settings from the compiler collection

Choose any optimization other than None (-O0) and vectorization will be performed by the compiler. For the command line, simply use the compiler options -O1, -O2, or -O3 and you are now taking advantage of the SSE features of the Intel Core Duo processor. Or are you? The next logical question is "how do I know that the compiler vectorized my code?".

Figure 5: Optimization settings under General Collection

This brings us to examine how we determine whether or not the compiler is vectorizing individual loops. The Intel compilers provide a vectorization report option that provides two kinds of information: First, the vectorization report will inform you which loops within your code are being vectorized. The end result of a vectorized loop is an instruction stream for that loop that contains SSE instructions. This is essential information to verify that the compiler is indeed vectorizing the loops within the code that you expect it to vectorize. Secondly and what we find critically important, is report information about why the compiler did NOT vectorize a loop and why it did not vectorize a loop. This information assists a programmer by highlighting the barriers that the compiler finds to vectorization.

Figure 6: Enabling the vectorization report

With the Intel compilers, one must enable the vectorization reporting mechanism. It is not enabled by default. Within the Xcode environment, the vectorization report is enabled by selecting one of the vector reports in the setting Vectorizer Diagnostic Report from the Diagnostics collection for the Target (Figure 6). The vectorization report is viewed in the Build Results window. The report follows the compilation for each source file, as shown in Figure 7

Figure 7: Vectorization report

The vectorization report option, -vec-report=<n>, uses the argument <n> to specify the information presented; from no information at -vec-report=0 to very verbose information at -vec-report=5. The arguments to -vec-report are:

n=0: No diagnostic information

n=1: (Default) Loops successfully vectorized

n=2: Loops not vectorized - and the reason why not

n=3: Adds dependency Information

n=4: Reports only non-vectorized loops

n=5: Reports only non-vectorized loops and adds dependency info

Inhibitors to vectorization

The Intel compilers attempt to vectorize loops within the code. However, not all loops can be vectorized. There are too many cases to list in the space of this article. We will examine a few common scenarios where the compiler cannot vectorize a loop.

Outer Loops: When there are nested loops, the vectorization is applied to the innermost loop. Outer loops are never vectorized, so you can expect -vec-report to identify these cases. This can be seen by the output of vec-report=3 in the example below:

$ ifort -O3 -vec-report=2  -o md md.f
 ...
md.f(212) : (col. 7) remark: loop was not vectorized: not inner loop.
md.f(213) : (col. 9) remark: LOOP WAS VECTORIZED.
    ...
    212       do i = 1,np
    213         do j = 1,nd
    214           pos(j,i) = pos(j,i) + vel(j,i)*dt + 0.5*dt*dt*a(j,i)
    215           vel(j,i) = vel(j,i) + 0.5*dt*(f(j,i)*rmass + a(j,i))
    216           a(j,i) = f(j,i)*rmass
    217         enddo
    218       enddo

In this abbreviated example from a molecular dynamics code, we see from the vectorization report that only the inner loop, the do j=1,nd loop, is attempted to be vectorized.

Data Dependencies: In order to be candidates for vectorization, a loop cannot contain dependencies between loop interations. Dependencies occur when a strict ordering of the iterations must be enforced. Consider the following loop:

void scale(float* z) {
 float A; int i;
 A = 42.0; 
 for ( i=0; i<10000; i++ )
     z[i] = A * z[i-1];  }

Which when compiled gives:

$ icc -O3 -vec-report=2 -c depend.c
depend.c(4) : (col. 2) remark: loop was not vectorized: existence of vector dependence.

Examining this, we see that in order to calculate the value to store in z[i] we need to have already calculated the value for z[i-1]. This forces a strict, sequential ordering to when the calculations must be performed. There are many other interesting cases to consider in dependency analysis and the reader is encourage to pursue this further by researching some of the references at the end of this article.

Function and Procedure calls: Another major inhibitor to vectorization is when the loop contains a function or procedure call. Consider this example:

      1 c   Pi:  Compute pi
      2 c
      3 c   Illustrates how to calculate the definite integral
      4 c   of a function f(x).
      5 c
      6 c   We integrate the function:
      7 c         f(x) = 4/(1+x**2)
      8 c   between the limits x=0 and x=1.
      9 c
     10 c   The result should approximate the value of pi.
     11 c   The method is the n-point rectangle quadrature rule.
     12         program computepi
     13         integer           n, i
     14         double precision  sum, pi, x, h, f
     15         external          f
     16         n = 1000000000
     17         h = 1.0/n
     18         sum = 0.0
     19         do 10 i = 1,n
     20            x = h*(i-0.5)
     21           sum = sum + f(x)
     22 10     continue
     23        pi = h*sum
     24        print *, 'pi is approximately : ', pi
     25        end

Within the do loop above, a function call to f(x) is made. In this example, the function f is in a separate source file. The code for f is as follow:

c   fx.f:  Integration function
   double precision function f(x)
     double precision x
        f =  (4/(1+x*x))
     end

When we attempt to compile these two source file with -vec-report, we get the following:

$ ifort -O3 -vec-report=2 -o pi pi.f fx.f
Pi.f(19) : ( col 12 ) remark: loop was not vectorized: contains unvectorizable statement at line 21

Looking at pi.f we see at line 19 there is a loop that is a candidate for vectorization. At line 21 we see the statement sum = sum + f(x). It is the call to the external function f(x) that is the issue. The external function may or may not contain data dependencies, thus the compiler makes the safe decision to not vectorize the loop

When one sees function or procedure calls within loops as in this example, the next logical step is to attempt to inline the function call. Inlining the function will allow the compiler to complete it's dependency analysis and often times allow vectorization of the loop. With the Intel compilers, options -ip and -ipo perform interprocedural optimizations. One of these optimization is function inlining. -ip is used to inline functions or procedures and perform optimizations that are contained within the same source file. -ipo is an advanced feature of the Intel compilers. With this option, the compilers are able to find inlining and optimization opportunities across source files, as in this example. In this case, fx.f is a separate file containing the function f(x). Compiling with -ipo gives:

$ ifort -O3 -ipo -vec-report=2 -o pi pi.f fx.f
IPO: performing multi-file optimizations
IPO: generating object file /tmp/ipo_ifort0FmkdQ.o
pi.f(19) : (col. 12) remark: LOOP WAS VECTORIZED.

The runtime of the non-vectorized program took 40 seconds on an iMac with a 1.83Ghz Intel Core Duo processor. The vectorized version took 17 seconds. We need to point out that this was a very trivial case. Deeply nested and complex procedure call trees that are called from within a loop will almost certainly never be able to be inlined.

Ill-defined loops: Compilers must be able to identify a loop and be able to determine the number of iterations, or trip count. Here are some example in C and Fortran:

 int count = 1;
    while (count <= 100){
        z[i] = x[i+1];
        count += 1;
 }
     I = 0
100  CONTINUE
     Z(I) = X(I+1)
     I = I + 1
     IF ( I .LT. 100 ) GOTO 100

Branching outside of the loop: whenever there is a conditional branch inside the loop this can disqualify the loop as a candidate for vectorization:

for ( int i=0; i<100 ; i++ ) {
   z[i] = x[i+1];
   if ( z[i] == 0 ) exit(-1);
}

Techniques to Improve Vectorization

We've already seen several techniques that improve vectorization. These include writing clearly defined loops that are easy for the compiler to recognize. Since vectorization is performed on inner loops, it is especially critical for these inner loops. Although it's counter to module programming techniques, for efficiency it is best to avoid deeply nested procedure calls inside of computational loops. Try to keep procedure calls to one level of nesting if at all possible. And although we did not mention this earlier, it is much easier for compilers to recognize inlining opportunities when functions and procedures are within the same source file. However, as we've seen, if you must have the functions in separate source files make sure you use the interprocedural optimization compiler switch, -ipo, provided by the Intel(R) Fortran Compiler and Intel(R) C++ Compiler for Mac OS.

Finally, instead of writing your own version of mathematical functions, where available use vectorized versions of libraries. As an example, the Intel Compilers for Mac OS ship with a short vector math library, libsvml. This library has vectorized versions of common math functions normally found in libm. The functions in libsvml include the common transcendental functions sin/cos/tan, asin/acos/atan as well as exp/pow, and ln/log10. In addition, the Intel compilers provide optimized memcpy, memcmp functions which are also quite prevalent thoughout any application. When using the Intel Compilers, this vectorized library will link prior to libm. Thus you will automatically link in vectorized versions of these common functions. Just remember to use the Intel drivers ( icc/icpc/ifort ) for compiling and linking and do NOT add -lm to the link arguments.

Finally, for more sophisticated mathematical, encryption, image processing and statistical functions, Intel provides two other library products. The Intel(R) Math Kernel Library (Intel(R) MKL) for Mac OS provides BLAS, FFT, and vectorized statistical libraries. These library functions are highly tuned and optimized to take maximum advantage the Digital Media Extensions of the Intel Core Duo processor. In addition to using SSE, these libraries are also multi-threaded to take advantage of both cores in the Intel Core Duo processor. Customers performing data compression, encryption, video encoding/decoding and speech processing will want to consider the Intel(R) Integrated Performance Primitives (Intel(R) IPP). Intel IPP routines are also highly tuned to utilize SSE.

Summary

The Streaming SIMD Extentions (SSE) architectural features of the Intel Core Duo processor enable integer and floating point acceleration for applications. SSE is a hybrid of traditional SIMD and vector processing methodologies. The Intel Fortran Compiler and Intel C++ Compiler refer to these techniques as vectorization. With the Intel Fortran Compiler and Intel C++ Compiler for Mac* OS, vectorization is enabled by default at optimization level 1 (-O1) and above. The Intel compilers also feature vectoriztion reporting via the -vec-report compiler option. Not only will the report list the location of loops vectorized, it will also list the locations of loops that were not vectorized and explain why it did not vectorize those loops. These hints enable the programmer to indentify vectorization inhibitors which can often be removed, leading to substantial performance improvements