Introduction to Arm Vectorization (Lab 5)

Consider an example where you have large amount of data and same operation is applied over and over again in iterative fashion:

    [ x1, x2, ..., xn ] + [ y1, y2, ..., yn ] = [ x1 + y1, x2 + y2, ..., xn + yn ]

It is important to note that all the additions here are independent, so you may immediately think that this snippet is a great candidate for parallel execution (maybe multi-threading?). The good news is you can parallelize it even if you are confined to a single CPU-core.

Vectorization functionality provides ways to execute a single instruction on multiple data (SIMD) in a single-core environment. It is typically accomplished with help of special-purpose registers (vector registers of larger capacity : 128, 256, 512 bits etc.) that can be split into segments of various size.

Vectorization is a special case of parallelization that can be used when operations (such as add, multiply etc.) need to be performed on sets of values (e.g. elements of array). Single CPU instruction doesn’t have to add only two numbers, it can perform operation on larger sets of numbers simultaneously thus resulting in obvious performance benefits.

The code snippet below creates 2 arrays with randomly generated numbers, populates 3d array element-by-element with sum of first two arrays and reduces it to a single value.

#include "stdio.h"
#include "stdlib.h"
#include <time.h>

#define ASIZE 1000

int main() {
        int ar1[ASIZE] __attribute__((__aligned__(16)));
        int ar2[ASIZE] __attribute__((__aligned__(16)));

        int sum[ASIZE] __attribute__((__aligned__(16)));
        int reduced = 0;

        // First, populate the arrays with random numbers
        for (int i = 0; i < ASIZE; i++) {
                ar1[i] = ((float)rand())/ RAND_MAX * (2 * ASIZE) - ASIZE;
                ar2[i] = ((float)rand())/ RAND_MAX * (2 * ASIZE) - ASIZE;
        // Add 1st & 2nd array to sum array
        for (int i = 0; i < ASIZE; i++) {
                sum[i] = ar1[i] + ar2[i];
        // Reduce sum array to a single value
        for (int i = 0; i < ASIZE; i++) {
                reduced += sum[i];
        printf("Total sum is : %d \n", reduced);

Note the unusual definition of the arrays:

int ar1[ASIZE] __attribute__((__aligned__(16))) 

The instruction tells the compiler to align elements to 16 bytes in memory. In theory, this should result in more efficient machine code (with usage of vectorized registers). It is worth mentioning that when it comes to SIMD, compilers tend to be quite conservative (meaning in many cases you need to be specific and provide some clues as to where and how to vectorise).

Let’s compile the ‘C’ source and analyse the produced assembly code to see if it worked as expected:

gcc -O3 -g main.c
objdump --source  a.out | less

The binary looks a little bit intimidating so it may make more sense for me to focus on the most important chunk aka loop containing vector additions. I annotated the assembly code so it is easier to navigate it :

// Add 1st & 2nd array to sum array

// for (int i = 0; i < ASIZE; i++) {
//     sum[i] = ar1[i] + ar2[i];
  3ce06aa0        ldr     q0, [x21,x0]     // load into q0 from memory 
                                           // (with mem location
                                           //  padded by loop index)
                                           // q - quadword (of 128 bit size)
  3ce06a81        ldr     q1, [x20,x0]     // load into q1 from mem (padded location)
  4ea18400        add     v0.4s, v0.4s, v1.4s   // SIMD addition <-- !!
  3ca06820        str     q0, [x1,x0]           // store result back into memory
  91004000        add     x0, x0, #0x10         // increment index by 16
  f13e801f        cmp     x0, #0xfa0            // compare index to 4000
  54ffff41    400628 <main+0xa8>    // continue looping if not equal
  4f000400        movi    v0.4s, #0x0           
  aa0103e0        mov     x0, x1                
  d285e801        mov     x1, #0x2f40           
  8b0103a1        add     x1, x29, x1
// }

// Reduce sum array to single value

// for (int i = 0; i < ASIZE; i++) {
                reduced += sum[i];
  3cc10401        ldr     q1, [x0],#16           // load into q1 (128-bit register)
  4ea18400        add     v0.4s, v0.4s, v1.4s    // SIMD addition <-- !!
  eb01001f        cmp     x0, x1                  
  54ffffa1    400654 <main+0xd4>     // continue looping if not reached
// }

// printf("Total sum is : %d \n", reduced);
  4eb1b800        addv    s0, v0.4s 
  90000000        adrp    x0, 400000 <_init-0x4d8>
  9121c000        add     x0, x0, #0x870
  0e043c01        mov     w1, v0.s[0]
  97ffffbf        bl      400570 <printf@plt>
// }

Even with annotation, it is probably still unclear as to what is going on here. It does seem like we get our array addition; also loop index gets incremented by 16 (?) and continues doing so until it reaches 4000.

I will first take a look at Add instruction and then go over the remaining details. Instead of regular purpose registers, the assembly instruction is using weird-looking V0.4 register. This official ARM dev reference can give us some clues as to what is going on here:

ADD (vector) **Syntax **ADD Vd.T, Vn.T, Vm.T


**Vd - Is the name of the SIMD and FP destination register, in the range 0 to 31. T - Is an arrangement specifier, and can be one of 8B, 16B, 4H, 8H, 2S, 4S or 2D. Vn - Is the name of the first SIMD and FP source register, in the range 0 to 31. Vm - Is the name of the second SIMD and FP source register, in the range 0 to 31.

Now, lets decipher our ADD operation. This line

add     v0.4s, v0.4s, v1.4s  

means add v0.4s register to v1.4s and save back into v0.4s. ARM reference tells us that v0 and v1 stand for particular registers (special vector registers that can range from 0 to 31), however it might be unclear from the definition what .4s part **actually means. This page can give us a little bit more information on the subject:

The instruction performs a parallel addition of four lanes of 32-bit elements packed into vectors stored in 128-bit registers. In other words, it is adding 4 numbers from one array with 4 numbers from another array simultaneously as demonstrated by picture below:

Note that adding 4 number in parallel means that every time we iterate through loop body, we grab 128 bits of information (4 integers) from our array. So we need to increment loop index by 16 bytes (128 bits) each iteration.

Final Words

I hope this post shed light on SIMD and introduced some basics. Here is a brief video I found interesting. The author demonstrates how SIMD can improve performance of picture brightness manipulation algorithm. One last thing, when it comes to vectorization it is important to keep in mind that not all algorithms can be vectorized and compilers are usually reluctant to add SIMD related optimisations. I had to modify my code several times in order to get vectorization added by the compiler.

Written on November 20, 2017