Spo - Stage 2

Since my last post dedicated to the final project, I decided to focus on a different algorithm called CRC. I found CRC code to be more interesting and I think there are more options for performance enhancements. I showed in my previous project post that the boost murmur hash function gets vectorized and that the optimizations done by the compiler are really effective.

For CRC performance improvements, I came up with the following plan:

  1. try out compiler options
  2. verify that the advanced options are ‘safe’ to use
  3. explore other optimization methods (e.g. code additions)

Compiler Options

In this section, I will attempt to utilize GCC compiler options and see how platform-specific and optimization flags can affect overall performance of the CRC algorithm in Boost library.

I will use this text file as test data and re-shuffle it during each iteration so the new CRC value is produced. The checksum generation and re-shuffiling will be done 3000 times for every run.

No Optimization

I will start with no-optimization option g++ -O0 -I boost_1_65_1 crc_benchmark.cpp -o crc_benchmark .

The benchmarking code (link) is run 5 times with perf command, the average time per CRC call is about 2.671 ms :

[obelavina@bbetty boost]$ make test_benchmark 
g++ -O0 -I boost_1_65_1 crc_benchmark.cpp -o crc_benchmark
perf stat -r 5 -o crc_benchmark.O0.perf -d ./crc_benchmark gutenberg.dat
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 2.671000 ms
-| Total time spent on CRC : 8014.320000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 2.671000 ms
-| Total time spent on CRC : 8014.451000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 2.671000 ms
-| Total time spent on CRC : 8014.452000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 2.671000 ms
-| Total time spent on CRC : 8014.457000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 2.673000 ms
-| Total time spent on CRC : 8020.216000 ms

Flag -O2

The next compiler option increases compilation time as well as performance of the generated code. Using -O2 resulted in much faster execution time (it’s about 6 times as fast as the previous run) :

[obelavina@bbetty boost]$ make test_benchmark 
g++ -O2 -I boost_1_65_1 crc_benchmark.cpp -o crc_benchmark
perf stat -r 5 -o crc_benchmark.O2.perf -d ./crc_benchmark gutenberg.dat
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1315.584000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.439000 ms
-| Total time spent on CRC : 1317.722000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.597000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.608000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.579000 ms

Flag -O3

Applying the ‘highest’ optimization level (-O3 ) resulted in pretty much the same performance as -O2 (with 0.438 ms average per CRC function call):

[obelavina@bbetty boost]$ make test_benchmark
g++ -O3 -I boost_1_65_1 crc_benchmark.cpp -o crc_benchmark
perf stat -r 5 -o crc_benchmark.O3.perf -d ./crc_benchmark gutenberg.dat
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.738000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.663000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.617000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.977000 ms
Started calculating CRCs 
Finished calculating CRCs 
-| Average time taken is : 0.438000 ms
-| Total time spent on CRC : 1314.612000 ms

Auto Code Vectorization

-fopt-info-vec-all GCC option can produce report on auto vectorization completed by the compiler. Unfortunately the code was not vectorized at all. CRC algorithm is rather tough to parallelize with SIMD: it is based on division operation (or rather XOR division) that takes large amount of data and divides it by a special value (polynomial); the quotient ends up thrown away but the remainder is used as the data checksum.

The boost code below does exactly that and, as you can see, the remainder is a loop-carried dependency (meaning the next iteration uses the value produced by the previous one).

void BOOST_CRC_OPTIMAL_NAME::process_block
( void const *  bytes_begin,  void const *  bytes_end) {
    // Recompute the CRC for each byte passed
    for ( unsigned char const * p
     = static_cast<unsigned char const *>(bytes_begin) ; p < bytes_end ; ++p )
    {
        // Compare the new byte with the remainder's higher bits to
        // get the new bits, shift out the remainder's current higher
        // bits, and update the remainder with the polynominal division
        // of the new bits.
        unsigned char const  byte_index = helper_type::index( rem_, *p );
        rem_ = helper_type::shift( rem_ );
        rem_ ^= crc_table_type::table_[ byte_index ];
    }
}

Platform Specific Flags

I also tried some platform specific flags such as enabling crc32, crypto additions and cpu type but I did not get any significant improvements. The disassembled code revealed that crc32 instructions were not applied and the enabling crypto features did not introduce any changes either.

Optimization & Safety

Using some advanced optimization flags (e.g. -O3) may not guarantee same results (or may not be safe). I think it is a good idea to verify that the binary produced by -O3 or -O2 outputs the same checksums by introducing some minor changes to the benchmarking code.

Since string shuffle function produces same output for all program runs, I decided to save checksums generated by safe -O0 version in a .txt file as

std::ofstream of("generated_checksums.txt");
for (int i=0; i<NUM_CRC; i++) {
  of << std::to_string(checksums[i]) << std::endl;
}

so the produced checksums can be used by other builds (e.g. -O2 & -O3)

for (int i=0; i<NUM_CRC; i++) {
  if(std::to_string(checksums[i]) != o0checksums[i]) {
    std::cout << std::to_string(checksums[i]) << "=/="<<o0checksums[i] << std::endl;
  }
}

I tested it with both -O2 & -O3 binaries and I did not get any mismatched checksums so the advanced optimization flags can be used without any concerns!

Code Experiments

I found it hard to work with boost since the library is full of boilerplate code and macro workarounds so extracting individual functions turned out to be quite a task. For some code additions & experiments, I decided to focus on much simpler example that uses nearly identical logic. I think those ideas can transferred to the library itself. The code I am talking about can be found here.

Overview of Bench-marking

The benchmarking (link source) for this part is the same as one used for boost. The program is supplied with large text file (gutenberg.dat) which gets reshuffled and reduced to a single checksum during each iteration (3000 iterations in total). Note that the string random shuffle takes quite some time so I had to restrict the performance measuring code to the CRC-relevant sections only.

Reflection

I decided to focus on the bit-reflection function which is called by the main CRC algorithm :

crc
crcFast(unsigned char const message[], int nBytes)
{
    crc remainder = INITIAL_REMAINDER;
    unsigned char  data;
    int            byte;
     
    /*
     * Divide the message by the polynomial, a byte at a time.
     */
    for (byte = 0; byte < nBytes; ++byte)
    {
        data = REFLECT_DATA(message[byte]) ^ (remainder >> (WIDTH - 8));
                  remainder = crcTable[data] ^ (remainder << 8);
    }

    /*
     * The final remainder is the CRC.
     */
    return (REFLECT_REMAINDER(remainder) ^ FINAL_XOR_VALUE);

} 

Boost CRC performs bit reflection too, some CRC protocols (e.g. crc-32) require bits to be in reverse order. From this article on crc:

The basic idea is to reverse the bit ordering of each byte within the message and/or the final remainder. The reason this is sometimes done is that a good number of the hardware CRC implementations operate on the “reflected” bit ordering of bytes that is common with some UARTs.

The function below does exactly that: it mirrors bits and returns the ‘reflected’ result;

static unsigned long reflect(unsigned long data, unsigned char nBits)
{
   unsigned long  reflection = 0x00000000;
   unsigned char  bit;

  /*
   * Reflect the data about the center bit.
   */
  for (bit = 0; bit < nBits; ++bit)
  {
      /*
       * If the LSB bit is set, set the reflection of it.
       */
       if (data & 0x01)
       {
          reflection |= (1 << ((nBits - 1) - bit));
       }

        data = (data >> 1);
  }

  return (reflection);
} /* reflect() */

I decided to try to improve on the code by substituting for loop and the loop body with rbit & right shift inline instructions. The arm documentation page tells us that RBIT reverses bit order.

static uint32_t
reflect_alt(uint32_t data, unsigned char nBits)
{
        uint32_t  reflection = 0x00000000;
        uint32_t* r_cursor = &reflection;

        asm (
          "rbit %[data], %[data]         \n\t"
          "asr  %[data], %[data], 56     \n\t"
          "strh %w[data], [%[r_cursor]]  \n\t"
        :"=r"(r_cursor):[r_cursor]"r"(r_cursor), [data]"r"(data));

        return (reflection);
}

I think the code can be improved even further by removing store instruction and combining reflection and remainder calculations.

Running with -O0 Compiling with gcc:

g++ main.cpp crc.c -o crc -O0

This resulted in much better performance (29260.509000 ms vs 9469.645000 ms) and all the checksums seem to be equal:

[obelavina@bbetty bgroup]$ ./crc gutenberg.dat 
The check value for the CRC-32 standard is 0xCBF43926
The crcFast() of "123456789" is 0xCBF43926
The crcFastAlt() of "123456789" is 0xCBF43926
Started calculating CRCs 
Finished calculating CRCs 
Time taken on Alternative CRC: 
-| Average time taken is : 3.156000 ms
-| Total time spent on CRC : 9469.645000 ms
Time taken on Original CRC: 
-| Average time taken is : 9.753000 ms
-| Total time spent on CRC : 29260.509000 ms

Running with -O3

I ran into some issues when I compiled the code with -03 flag, the checksums of the modified function were’t equal to the checksums produced by the original code:

...
...
Crc #2988 : 1942431320 =/= 2228724315 
Crc #2989 : 1942431320 =/= 1217714242 
Crc #2990 : 1942431320 =/= 1207245759 
Crc #2991 : 1942431320 =/= 4109246178 
Crc #2992 : 1942431320 =/= 1070734304 
Crc #2993 : 1942431320 =/= 3957145584 
Crc #2994 : 1942431320 =/= 1131823201 
Crc #2995 : 1942431320 =/= 932842861 
Crc #2996 : 1942431320 =/= 965216922 
Crc #2997 : 1942431320 =/= 2602600163 
Crc #2998 : 1942431320 =/= 1923685221 
Crc #2999 : 1942431320 =/= 2793478618 
Some CRC(s) do not match! 

Adding volatile keyword to the asm section did not make any difference but adding ‘memory’ to the clobber solved the problem :

 asm volatile(
  "rbit %[data], %[data]         \n\t"
  "asr  %[data], %[data], 56     \n\t"
  "strh %w[data], [%[r_cursor]]  \n\t"         
:"=r"(r_cursor):[r_cursor]"r"(r_cursor), [data]"r"(data):"memory");

Here’s the produced result (working as expected). As you can see, reversing with inline instructions seems to be twice as fast as the original code.

obelavina@bbetty bgroup]$ ./crc gutenberg.dat 
The check value for the CRC-32 standard is 0xCBF43926
The crcFast() of "123456789" is 0xCBF43926
The crcFastAlt() of "123456789" is 0xCBF43926
Started calculating CRCs 
Finished calculating CRCs 
Time taken on Alternative CRC: 
-| Average time taken is : 0.509000 ms
-| Total time spent on CRC : 1528.186000 ms
Time taken on Original CRC: 
-| Average time taken is : 1.169000 ms
-| Total time spent on CRC : 3508.637000 ms

I think the better way to improve the reflection function on arm machine would be to utilize endianness. I did find it quite interesting to try out some inline assembly additions and the binary seems to be quite effective too.

Failed Experiments (Hall of Shame)

Initially, I planned to utilize vector registers as look-up table since there’re 31 of them and they are barely used. Fast CRC algorithms store precomputed remainder values in a look-up table so the calculations are sped-up. You can put roughly half of the CRC32 table in the vector register space and almost all pre-computed values for the 16-bit based CRC (e.g. CRC-CCITT standard).

I decided to play around with the arm intrinsics and see if the idea can be implemented with something more straightforward than inline assembly. I soon realized that those built-in arm functions work in quite bizarre ways e.g. I naively expected the code snippet below to populate entire vector register space, but it was simply reloading the same v0 over and over again:

uint32x4_t vec_table[31];
void populateVecTbl() {
    __builtin_prefetch(&crcTable[124]);
    for (int dividend = 0; dividend < 124; dividend+=4) {
      int tbl_i = dividend/4;
      vec_table[tbl_i]=vld1q_u32(&crcTable[dividend]);
    }
}

I also tried to use inline assembly but I ran into another issue: I had 2 asm sections (one was loading the table into the vector registers and the second one was performing lookup operation). Unfortunately the vector registers ended up being overwritten even though there were no function calls in between 2 sections. I dropped the idea of using look-up table since I figured the loop-carried dependency reduces efficiency of SIMD lookup instruction (meaning you cannot look up several indices at the same time).

Final Words

I think the journey was quite bumpy for me but I really enjoyed working with the source code. I ran into some really ambiguous issues (e.g. weird behaviour of intrinsics) and I hit several deadlocks. I am glad I got a chance to improve on my debugging skills. GDB is my best friend now and I no longer think it is the most boring debugging tool out there. Overall, I think the compiler flags turned out to be the most reliable way to get the performance boost. I tried out several code optimization options and the results proved that there is some space for source code improvements as well.

Written on January 8, 2018