Daily Archives: December 11, 2017

Parallelising AES-CTR with OpenMP

In order for AES (or any block cipher) to be particularly useful you generally need to use it in a “mode of operation” which allows it to work with much larger data than a single 128 bit block. Some modes of operation use the underlying block cipher and create a stream cipher. Both AES-OFB and AES-CTR are stream ciphers that use the outputs of the block encryption to produce 16 bytes of stream output at a time. They operate very similarly and when single threaded are the same speed.

AES-OFB keeps on applying the AES encryption on the same block to generate a new block each time. Whereas AES-CTR apples AES encryption on a “counter” block. Cryptanalysis has shown both methods to be equally secure. AES-CTR has two huge advantages over AES-OFB which is why in my opinion it should be used in preference always.

The first advantage is that you can jump to any point in the output stream without having to generate preceding bytes. This would be useful if you were appending to a file for example.

The second advantage is that because each block is calculated independently of other blocks, the entire process can be parallelised. If 1 MB of data is to be encrypted with AES-CTR then on a quad core processor the task can be split into encrypting 256 kB, each performed in a seperate thread. It is not possible to parallelise AES-OFB as each block requires the previous block to have been calculated first.

Parallelising an algorithm such as AES-CTR is a bit different from regular multithreading. If the AES-CTR library had to be responsible for the threading it puts in quite a burden of management, and also the overhead of bringing up threads and closing them each time its called could be more than the improvements in speed. Alternatively providing a set of interfaces to allow the caller to provide the multithread environment could be quite cumbersome and tricky to use. Fortunately there is a standard that is perfect for this job, OpenMP.

OpenMP is a standard that is implemented in many C compilers (including, surprisingly, Microsoft Visual Studio). This allows functions to be marked-up using special pragmas that will allow them to be parallelised when built with OpenMP support, and also run correctly in a single thread without OpenMP.

The following mark up will cause the for loop to be parallelised with OpenMP. The for loop will run on as many threads as there are processing cores (by default, this can be changed). Each thread will run a smaller subset of the range i – numIterations. Without OpenMP this will just run as normal over the entire range.

#ifdef _OPENMP
    #pragma omp parallel for
#endif
for( i=0; i<numIterations; i++ )
{
    ...
}

There are several extra markups that can be added that control how the threads share data etc. The #ifdef is not technically needed as the #pragma will be ignored by compilers that don’t understand it. However some compilers will warn about unknown pragmas so it can be quieter to #ifdef it.

My AES-CTR implementation without OpenMP gave the following results on a quad core MacBookPro running Linux.

AES-CTR (128) 232.33 Mbps
AES-CTR (192) 203.29 Mbps
AES-CTR (256) 177.69 Mbps
RC4 368.89 Mbps

As expected RC4 is the fastest as it is much simpler.

I reworked my AES-CTR implementation to work with OpenMP and the performance increase is considerable.

AES-CTR (128) 730.00 Mbps
AES-CTR (192) 518.44 Mbps
AES-CTR (256) 489.51 Mbps
RC4 363.46 Mbps

This is over 3 times as fast. Interestingly it is not 4 times as fast, despite now running 100% CPU usage over 4 cores instead of just 1. I assume the reason is because the 4 processors still need to access the same memory controller and that becomes the bottleneck. The memory controller can’t serve all four cores simultaneously at the same speed it can service just one.

As a side note, when I first wrote this using my original AES implementation, it was so slow that even with the paralysation it was still out performed by RC4! This motivated me to change the AES block cipher implementation to a faster one.

The nice thing about OpenMP is that it can just be enabled at build time without a lot of fuss (-fopenmp for gcc and /openmp for MSVC). If it is not enabled the code works fine in a single thread.

There is one disappointing thing about Microsoft’s implementation. It requires the use of a DLL which is not a system DLL. Visual Studio 2017 dynamically links OpenMP executables to¬†VCOMP140.DLL. There is no option to statically link it. Also Apple’s version of clang that comes with Xcode does not support OpenMP.

Public Domain C source code for AES-CTR with optional OpenMP support

This is free and unencumbered software released into the public domain.

 

Advertisements

A faster AES implementation

When I added AES to CryptLib I found a nice compact public domain implementation to adapt. I liked it because it was quite a bit smaller in source code than the majority of implementations I had seen, and the final binary size of the executables were small. However when I wrote a speed test for it I discovered it had incredibly poor speed. The trade-off for the compactness was speed. In some cases this is desirable as it may need to be in a very small processor and doesn’t have to run particularly fast. However I wanted a faster one for CryptLib so I found a different implementation that is almost 5 times as fast.

Public domain C source code for AES

(Alternative links: CryptLib_Aes.c and CryptLib_Aes.h)

This is free and unencumbered software released into the public domain.