Harnessing the Power of SIMD Programming Using AVX (2024)

Nevin Baiju

1. Speed and Efficiency

Deep learning models, particularly neural networks, involve a tremendous amount of mathematical computations. These computations are often performed on large datasets, which can be time-consuming. SIMD allows you to accelerate these calculations by performing operations on multiple data points simultaneously. This means faster training times and quicker model evaluation.

2. Parallelism

Modern processors are designed with parallelism in mind. They contain multiple cores that can execute instructions concurrently. SIMD leverages this hardware parallelism by efficiently using these cores to process data in parallel. This can result in a significant reduction in the time it takes to complete tasks.

3. Performance Optimization

Deep learning frameworks like TensorFlow and PyTorch are built on lower-level libraries that utilize SIMD instructions to optimize performance. By understanding SIMD programming, you can gain insights into how these frameworks work under the hood. This knowledge empowers you to write more efficient code and customize your deep learning pipelines for maximum speed. Furthermore, there are processor-specific optimizations we can perform if we understand the underlying computational mechanisms well.

Now that we’ve established the importance of SIMD programming, let’s take a closer look at AVX or Advanced Vector Extensions. AVX is an instruction set architecture extension introduced by Intel and AMD for x86 processors. It extends the SIMD capabilities of these processors by introducing wider registers and new instructions for performing vector operations.

Harnessing the Power of SIMD Programming Using AVX (3)

In some modern processors, there can be multiple Vector registers (In simple terms, vector registers are registers that hold more than one data and perform the same process on all of them). These modern processors can perform multiple vector instructions in a single clock cycle. So can you estimate how many additions can an Intel processor with a max clock speed of 3.5GHz and 12 cores do?

AVX is a game-changer for SIMD programming because it provides an extensive set of instructions to perform vector processing. Some of these instructions are:

Add
Multiply
Store
Load
More operations to move values in and around the registers.

Furthermore, Intel has blessed us with an extensive guide for referring to the different AVX commands in the form of this Intrinsics Guide

AVX and AVX512

Before we look at how we can write code using AVX, let us briefly discuss what AVX and AVX512 means. AVX is a standard for processing 256 bit information. What this means is that it can hold 256 bits of information and process them. Let us recall how much a single precision float number takes up in memory, you’re right! It is 32 bits. So we can fit 8 float numbers each 32 bits of size in 256 bits of our AVX registers.

If you’re connecting the dots, it means that we can process 8 floating point numbers at the same time using our SIMD programming paradigm. And like the name suggests, AVX512 can hold 512 bits of data. However, it is only available to some of the higher-end and server-grade processors in Intel. For most of us with personal computers, we will have to experiment with the normal AVX instructions.

Now that we understand the concepts behind SIMD and AVX, let’s dive into the practical implementation. We’ll walk through some real-world coding examples in C++ to illustrate the power of SIMD programming.

Imagine you have two large arrays of size n. And you want to add them together. Let us examine how we would do this using AVX. First let us look into the pseudo code of how we would perform this:

Loop:
While i < n:
 Load 8 elements from Array A into vector register X
 Load 8 elements from Array B into vector register Y Perform vector addition:
 Z = X + Y
 Store the 8 elements from vector register Z into Array C
 Increment i by 8 (to move to the next 8 elements)
End Loop

Imagine we have two arrays A and B. We want to add them at every index and store the values to an array C. Ideally, we would iterate from 0 to n and then write a simple code something like:

Loop:
While i < n:
 C[i] = A[i] + B[i]
 Increment i by 1.
End Loop

However, when it comes to writing AVX code, we will have to load them explicitly to registers. After that we again have to explicitly write them to the array. It might seem like a long stretch, but the advantage is that we get to perform 8 adds in a single step.

#include <immintrin.h> // Include AVX headervoid vectorSum(float* A, float* B, float* C, int n) {
 // Loop for processing 8 elements at a time
 for (int i = 0; i < limit; i += 8) {
 // Load 8 elements from Arrays A and B into AVX registers
 __m256 avx_a = _mm256_loadu_ps(&A[i]);
 __m256 avx_b = _mm256_loadu_ps(&B[i]);
 // Perform vector addition
 __m256 avx_result = _mm256_add_ps(avx_a, avx_b);
 // Store the result back into Array C
 _mm256_storeu_ps(&C[i], avx_result);
 }
}

Trust the process! If you’re with me so far, don’t worry, don’t let the __m_blah_blah scare you. They are pretty simple once you understand what that means.

First and foremost, since we are using AVX instructions, we need to import immintrin.h header file.
__m256 indicates that it is a 256-bit register, it is typically used for handling float. There are other variants, __m256i and __m256d for integers and double respectively.
_mm256_loadu_ps: As you can guess, it is used for loading data. But what does the ps mean? Single Precision! I.e. our float data. Since we can hold 8 single-precision float values, that is what it expects. It expects the address of an array that has 8 values ready to be loaded. And like our __m256, it has its variants for double, integers, and shorter numbers.
_mm256_add_ps: Takes two registers adds the 8 numbers in those registers and returns a register with the result.
_mm256_storeu_ps: Just like load, it stores the values in a register to an array index.

Also, note that we are incrementing the array by 8 indices as we are processing 8 numbers at a time.

This all might feel overwhelming now, but the Intel Intrinsics guide is your best friend when it comes to navigating what these functions are and what they mean.

Finally, you just compile this program just like a normal C++ program. However, you need to add the ‘-mavx’ flag with the normal compiling instruction to let the compiler know that you are using AVX instructions.

Implementing SIMD programming using AVX can yield substantial benefits in terms of speed and efficiency. Let’s summarize these advantages:

By processing multiple data elements in parallel, AVX accelerates your code, leading to faster execution times. This speed boost is crucial for applications like real-time image processing, video rendering, and deep learning.

Faster execution not only saves time but also reduces energy consumption. This is particularly important in applications running on battery-powered devices, where energy efficiency is a primary concern.

AVX is widely supported on modern x86 processors, ensuring compatibility with a broad range of hardware. This means your optimized code will run efficiently on a variety of systems.

Many interpreted languages like Python bring computational code up to speed by calling C libraries. And they are written using these low-level instructions. Most of us will be encountering those libraries if ever we venture into the quest of optimizing runtime for our deep learning models. So when that day comes, don’t forget that your journey started here! :)

This was meant to be a high-level introduction to SIMD programming using AVX. There are a lot of advanced concepts that need to be understood before we can actually leverage the full potential of our machines. One key point to note is that existing libraries and languages are written with a high degree of optimization.

Of course, there is a lot of scope for improvement, however using this low-level programming paradigm in the wrong way can worsen performance. But that should not stop you from looking into how BLAS libraries or deep learning libraries are written. Finally, computation is one part of the equation, and another part is memory, but that is for another day!

Happy coding!

Harnessing the Power of SIMD Programming Using AVX (2024)

FAQs

What is AVX SIMD? ›

Advanced Vector Extensions (AVX, also known as Gesher New Instructions and then Sandy Bridge New Instructions) are SIMD extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD).

Know More ›

Does AVX improve performance? ›

Performance Considerations: AVX can significantly boost performance for certain workloads, especially those involving parallelizable operations on large datasets. However, its effectiveness depends on the specific nature of the computations.

See Details ›

What are AVX instructions used for? ›

AVX or Advanced Vector Extensions are additions to the x86 instruction set architecture, which pertain to Intel and AMD CPU's that use x86 architecture. Put simply, the additional instruction set allow compatible processors to perform more demanding functions when used with compatible software.

Discover More Details ›

Why is AVX 512 important? ›

Intel® AVX-512 helps you optimize ROI and achieve the advanced workload performance you need without introducing the complexity and cost of discrete accelerators.

Get More Info ›

What is SIMD good for? ›

SIMD instructions are widely used to process 3D graphics, although modern graphics cards with embedded SIMD have largely taken over this task from the CPU. Some systems also include permute functions that re-pack elements inside vectors, making them particularly useful for data processing and compression.

Do all CPUs have SIMD? ›

Nearly all modern processors have the ability to perform multiple independent operations using one instruction. These are called Single Instruction, Multiple Data (SIMD) instructions or vector instructions.

Tell Me More ›

Why is SIMD faster? ›

SIMD architectures typically organize data into vectors or arrays, enabling synchronized execution and faster computational throughput.

Know More ›

Do GPUs have SIMD? ›

It is thus no surprise that GPUs used SIMD units since the early days to implement vector instructions. It is also not a coincidence that the first programmable shaders used assembly-like shading languages providing instructions operating on 4-component vectors.

Get More Info ›

Why disable AVX? ›

Others consider it a power virus, and reviewers are known to use heavy AVX loads to simulate a worst-case stress test scenario to generate maximum temperatures and power consumption readings. A cynic might say that Intel disabled AVX-512 to reduce extreme power consumption readings in reviews.

Keep Reading ›

Do Xeon processors have AVX? ›

With Emerald Rapids featuring improvements around its AVX-512 support, such as allowing the processors to reach higher frequencies during AVX-512 workloads, I was curious to run some benchmarks to help quantify the benefits of AVX-512 with these new Intel Xeon Scalable server processors.

Know More ›

Does Windows use AVX? ›

It includes extensions to both instruction and register sets. Microsoft has developed some API enhancements, such as XState functions, that enable applications to access and manipulate extended processor feature information and state, including Intel AVX.

View Details ›

Which CPU has AVX? ›

Consider upgrading to a compatible CPU like the Intel Core i7-870 or i7-880, both of which support AVX. They'd provide the needed boost for Starfield and maintain compatibility with your motherboard.

Know More ›

Why is Intel killing AVX-512? ›

Most of the tech press accepted that. Intel said it wasn't enabled due to the inclusion of two different architectures. The E cores didn't support it, even if the P cores did. Motherboard manufacturers used this bit of knowledge to cheekily allow users to enable AVX-512 after disabling the E cores.

View Details ›

Is AVX-512 dead? ›

Only the 11900k lists AVX-512 as an available Instruction Set Extension. So it's reasonable to say that AVX-512 on consumer Intel lines is dead for now, whereas AMD has just introduced it in the Ryzen 7000 series.

Know More ›

How many AVX registers are there? ›

Assuming a CPU with AVX at all (i.e. not Pentium/Celeron, even latest-generation): 32-bit mode always has 8 architectural YMM registers. 32-bit mode is mostly obsolete for high-performance computing. 64-bit mode has 16 YMM regs, or with AVX512VL, 32 if you include using EVEX-encoded 256-bit versions of instructions.

Show Me More ›

Should I turn AVX off? ›

If your workload uses AVX, yes, turning off those instructions would reduce temperatures. But it would also kill your AVX performance. But that's as much a solution as turning your computer off is a solution to reduce CPU temps.