Inline Assembly and Music

For this lab (lab7), I will be revisiting the problem of scaling digital audio. In one of the previous posts, I attempted to improve on scaling operation performance using 2 enhanced algorithms: look-up table & fixed-integer arithmetic.

This post will be dedicated to code optimisation as well, although the method we will be exploring is going to be a bit more low-level. We will look at inline-assembly and how it can be used to explicitly vectorize the audio scaling code.

Comparison of 2 Methods

The task was ‘dumbed-down’ for us by the professor and, unlike previous semester students, we get the code package right away. My goal is to analyse the C source, compare the performance to the previous solution and answer some questions inserted here and there in the code.

First, I will re-run the original fixed-point solution since I ‘moved’ to a different machine. Fixed-point integer math turned out to be the best algorithm (so far) in terms of performance so I will use it as a reference; Note that the code was tested with 500000000 array length. This time, instead of using time command, I decided to benchmark the code with perf toolbox (just for the sake of diversity).

This perf command perf stat -r 10 -d ./p3 will run the binary solution 10 times and show total time elapsed as well as some other statistics. Since the output is guaranteed to be misaligned by the Jekyll theme I am using, I decided to put it into gist here.

Anyways, the most important bit of information is: 32.031639811 seconds time elapsed for 10 runs. This line L1-dcache-load-misses:u # 5.89% of all L1-dcache hits can be of some interest to us as well since lots of L1 cache misses can lead to drastic decrease in performance (meaning CPU can’t find data in its cache, you can get some overview of it here). In this particular case, the program did pretty well and we can forget about cache lines altogether.

The next step is to compare it to the vectorized code. The inline-assembly solution can be compiled with the make command make vol_simd, I also changed the default array size to 500000000.

Running with the same command : perf stat -r 10 -d ./vol_simd produces this output which shows that the vectorized code couldn’t really beat the fixed-point integer solution: 31.818169200 seconds time elapsed vs 32 seconds we got in the previous test run.

Conclusion The performance of the vectorized solution seems to be on par with the fixed-point integer algorithm (which means it is better than look-up table and naive audio scaling). Maybe if I single out the scaling code itself (rather than bench-mark the whole program), I can get more accurate results. It would be also interesting to see how much performance can change with different array lengths.

Code Analysis

Here’s the code for the vectorized sound scaling. The asm section attempts to utilise vector registers and scale 8 elements of the sound sample array simultaneously.

int main() {

        int16_t*                in;                // input array
        int16_t*                limit;                // end of input array
        int16_t*                out;                // output array

        // these variables will be used in our assembler code, so we're going
        // to hand-allocate which register they are placed in
        // Q: what is an alternate approach?
        // A: You can leave it to  the compiler to pick up registers and use % sign to
        // reference particular value
        register int16_t*        in_cursor         asm("r20");        // input cursor
        register int16_t*        out_cursor        asm("r21");        // output cursor
        register int16_t        vol_int                asm("r22");        // volume as int16_t

        int                        x;                // array interator
        int                        ttl;                // array total

        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        printf("Generating sample data.\n");
        for (x = 0; x < SAMPLES; x++) {
                in[x] = (rand()%65536)-32768;

// --------------------------------------------------------------------

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        // set vol_int to fixed-point representation of 0.5 
        // Q: should we use 32767 or 32768 in next line? why?
        // A: To prevent integer overflow, the int16_t range is 
        //  -32,768 .. 32,767 and 32768 is beyond the max value
        vol_int = (int16_t) (0.5 * 32767.0);

        printf("Scaling samples.\n");

        // Q: what does it mean to "duplicate" values here?
        // A: 'Spread' values of w22 (vol_int) into each lane 
        // of vector register v.1.8h (8 lanes in total, 
        // each of size 16bits) 
        // These copies of scalar value will be used in SIMD operation with 
        // the audio samples 
        __asm__ ("dup v1.8h,w22"); // duplicate vol_int into v1.8h
        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [x20],#16                \n\t"
                        // load eight samples into q0 (v0.8h)
                        // from in_cursor, and post-increment
                        // in_cursor by 16 bytes
                        "sqdmulh v0.8h, v0.8h, v1.8h        \n\t"
                        // multiply each lane in v0 by v1*2
                        // saturate results
                        // store upper 16 bits of results into v0
                        "str q0, [x21],#16                \n\t"
                        // store eight samples to out_cursor
                        // post-increment out_cursor by 16 bytes

                        // Q: what happens if we remove the following
                        // lines? Why?
                        // A: The program will compile but running 
                        // it will result in 'Seg Fault'
                        // Reason : the asm section needs to know which values it is
                        // supposed to use and where it can put its output 
                        : "=r"(in_cursor)
                        : "r"(limit),"r"(in_cursor),"r"(out_cursor)

// --------------------------------------------------------------------

        printf("Summing samples.\n");
        for (x = 0; x < SAMPLES; x++) {

        // Q: are the results usable? are they correct?
        // A: Seems usable and correct to me :) 
        printf("Result: %d\n", ttl);

        return 0;


Assembly in Open Source Software MLT

In this part I will be looking at open source package called MLT. The goal is to find how much assembly is used and why it’s present there in the first place. MLT is an open source multimedia framework designed for television broadcasting; it also includes a multi-unit video playout server with realtime effects.

The project source can be found in this github repo which tells us immediately that the assembly is about 0.2% of the total code base (It’s unclear if it accounts for inline assembly sprinkled in C files or if it only looks at .S files, I am inclined to believe the numbers).

I found few assembly instructions written for ARM, most of the asm sections are dedicated to x86_64 SIMD operations (either SSE or its predecessor MMX).

One source file called cpu_accel.c seems to be specifically designed to determine if either MMX or SSE are supported (and what levels (e.g. SSE3,4) are present, the level of support can be inferred from CPUID). Apart from that, It seems like the logic almost ignores Arch (…and there’s some PowerPc stuff (??)).

Here’s something pretty interesting: SIMD x86_64 implementation of Sum of Absolute Differences (source file sad_sse.h ). SAD is a measure of the similarity between image blocks and it is typically used in object recognition and video compression (MLT uses it for motion estimation). Here’s a simplified example: given grey-scale 2x2 image, find the part of the original image (2x3) that is most similar to the search template (note that the example uses values from 0 to 100 rather than standard 0 to 255).

The search image can either be placed at position 1 or position 2.

If we position search image at 1 and subtract its values from the original image, we get: |10-0|=10, |60-50|=10, |90-80|=10, |90-80|=10 and the total value is 40.

The same can be applied to the second position: |10- 50|=40, |60-40|=20, |90-80|=10, |90-40|=50 and the total value is 120.

Window that has the lowest sum of absolute differences (the darkest area) is the one that is most similar to the search image. 40 is less than 120, so the search image is more likely to come from the position 1 rather than position 2. Indeed, if we look at search template values, they seem to be slightly brightened original pixels at the leftmost position.

Here’s the code snippet from MLT with vectorized SAD implementation :

// Sum two 8x1 pixel blocks
#define SAD_SSE_SUM_8(OFFSET) \
                        "movq " #OFFSET "(%0),%%mm0                \n\t"\
                        "movq " #OFFSET "(%1),%%mm1                \n\t"\
                        "psadbw %%mm1,%%mm0                        \n\t"\
                        "paddw %%mm0,%%mm6                        \n\t"\

The psadbw instruction was specifically designed for calculating the sum of absolute differences. The instruction computes the absolute differences of the packed unsigned byte integers and then sums it to produce an unsigned word integer result. I think the SIMD SAD implementation does have its place and the compiler may not be smart enough to optimize the code down to psadbw instruction.

Written on December 28, 2017