Monthly Archives: December 2017

Parallelising AES-CTR with OpenMP

In order for AES (or any block cipher) to be particularly useful you generally need to use it in a “mode of operation” which allows it to work with much larger data than a single 128 bit block. Some modes of operation use the underlying block cipher and create a stream cipher. Both AES-OFB and AES-CTR are stream ciphers that use the outputs of the block encryption to produce 16 bytes of stream output at a time. They operate very similarly and when single threaded are the same speed.

AES-OFB keeps on applying the AES encryption on the same block to generate a new block each time. Whereas AES-CTR apples AES encryption on a “counter” block. Cryptanalysis has shown both methods to be equally secure. AES-CTR has two huge advantages over AES-OFB which is why in my opinion it should be used in preference always.

The first advantage is that you can jump to any point in the output stream without having to generate preceding bytes. This would be useful if you were appending to a file for example.

The second advantage is that because each block is calculated independently of other blocks, the entire process can be parallelised. If 1 MB of data is to be encrypted with AES-CTR then on a quad core processor the task can be split into encrypting 256 kB, each performed in a seperate thread. It is not possible to parallelise AES-OFB as each block requires the previous block to have been calculated first.

Parallelising an algorithm such as AES-CTR is a bit different from regular multithreading. If the AES-CTR library had to be responsible for the threading it puts in quite a burden of management, and also the overhead of bringing up threads and closing them each time its called could be more than the improvements in speed. Alternatively providing a set of interfaces to allow the caller to provide the multithread environment could be quite cumbersome and tricky to use. Fortunately there is a standard that is perfect for this job, OpenMP.

OpenMP is a standard that is implemented in many C compilers (including, surprisingly, Microsoft Visual Studio). This allows functions to be marked-up using special pragmas that will allow them to be parallelised when built with OpenMP support, and also run correctly in a single thread without OpenMP.

The following mark up will cause the for loop to be parallelised with OpenMP. The for loop will run on as many threads as there are processing cores (by default, this can be changed). Each thread will run a smaller subset of the range i – numIterations. Without OpenMP this will just run as normal over the entire range.

#ifdef _OPENMP
    #pragma omp parallel for
for( i=0; i<numIterations; i++ )

There are several extra markups that can be added that control how the threads share data etc. The #ifdef is not technically needed as the #pragma will be ignored by compilers that don’t understand it. However some compilers will warn about unknown pragmas so it can be quieter to #ifdef it.

My AES-CTR implementation without OpenMP gave the following results on a quad core MacBookPro running Linux.

AES-CTR (128) 232.33 Mbps
AES-CTR (192) 203.29 Mbps
AES-CTR (256) 177.69 Mbps
RC4 368.89 Mbps

As expected RC4 is the fastest as it is much simpler.

I reworked my AES-CTR implementation to work with OpenMP and the performance increase is considerable.

AES-CTR (128) 730.00 Mbps
AES-CTR (192) 518.44 Mbps
AES-CTR (256) 489.51 Mbps
RC4 363.46 Mbps

This is over 3 times as fast. Interestingly it is not 4 times as fast, despite now running 100% CPU usage over 4 cores instead of just 1. I assume the reason is because the 4 processors still need to access the same memory controller and that becomes the bottleneck. The memory controller can’t serve all four cores simultaneously at the same speed it can service just one.

As a side note, when I first wrote this using my original AES implementation, it was so slow that even with the paralysation it was still out performed by RC4! This motivated me to change the AES block cipher implementation to a faster one.

The nice thing about OpenMP is that it can just be enabled at build time without a lot of fuss (-fopenmp for gcc and /openmp for MSVC). If it is not enabled the code works fine in a single thread.

There is one disappointing thing about Microsoft’s implementation. It requires the use of a DLL which is not a system DLL. Visual Studio 2017 dynamically links OpenMP executables to VCOMP140.DLL. There is no option to statically link it. Also Apple’s version of clang that comes with Xcode does not support OpenMP.

Public Domain C source code for AES-CTR with optional OpenMP support

This is free and unencumbered software released into the public domain.



A faster AES implementation

When I added AES to CryptLib I found a nice compact public domain implementation to adapt. I liked it because it was quite a bit smaller in source code than the majority of implementations I had seen, and the final binary size of the executables were small. However when I wrote a speed test for it I discovered it had incredibly poor speed. The trade-off for the compactness was speed. In some cases this is desirable as it may need to be in a very small processor and doesn’t have to run particularly fast. However I wanted a faster one for CryptLib so I found a different implementation that is almost 5 times as fast.

Public domain C source code for AES

(Alternative links: CryptLib_Aes.c and CryptLib_Aes.h)

This is free and unencumbered software released into the public domain.


Setting up a virtual Big Endian environment

I mentioned previously that it is a good idea to be able to actually test your code works on a Big-Endian architecture. I recently updated CryptLib and made it compatible for Big-Endian. The way I tested it was to use QEMU to emulate a Big-Endian MIPS processor running Linux. QEMU creates Virtual Machines, but unlike VMWare (which virtualises the computer but leaves instructions running on the hardware) it emulates the CPU. This is great because we can now run an operating system for a different chip. It is however extremely slow.

There is pretty much a Linux distribution for every CPU ever made, so you should be able to install a version of Linux for any chosen chip emulation in QEMU. I chose to use MIPS because I know that is big-endian and is also the chips used in a bunch of OPEN-WRT routers, so I figured it would be supported.

Installing the OS from an installation .iso takes a long time, however I was lucky enough to find that someone had made a QEMU image of a Debian MIPS installation available on the Internet:

From that site download the two files:

Install QEMU if you haven’t already got it. Then run the following command

qemu-system-mips -M malta -kernel vmlinux-3.2.0-4-4kc-malta -hda debian\_wheezy\_mips\_standard.qcow2 -append "root=/dev/sda1 console=tty0" -nographic -redir tcp:2222::22 -m 256

This tells QEMU to boot up the image with no graphics terminal, and instead output the console to stdout. This means the output will just appear in your terminal rather than coming up as a separate window.

QEMU will by default give this a NAT network interface. The -redir flag says to redirect the host port 2222 to the virtual machine’s internal 22 (SSH) port. This makes it easy to SSH into the guest from another computer on your network by SSHing to port 2222 of your host machine.

When you run this you will see Linux slowly boot up. After a while you will get a login prompt. This image has been setup with the user accounts:

user (password user) as a standard user account.
root (password root) as root account.

It is a fairly minimal install, but it does have SSH on, which for my purposes was sufficient. as that is enough to copy files onto it (using scp) and then to execute them.

You can add to this image by installing gcc etc, however actually using the VM to build is extremely slow. I did attempt to do it, but it is not worth the effort. For example to install cmake required building it from source. That took 24 hours!

Instead of trying to build on the VM, the best approach is to build on your actual computer using a cross-compiler.

So if you are using Linux then it is easy, simply run

sudo apt-get install gcc-mips-linux-gnu g++-mips-linux-gnu

to install a MIPS cross compiler.

If you use CMake then it is very easy to use a ToolChain file that is setup to use the mips compiler. Here is my toolchain file. I call it mips.cmake

# toolchain for compiling Linux for MIPS

set(CMAKE_C_COMPILER mips-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mips-linux-gnu-g++)

# Compiler settings
set( CMAKE_C_FLAGS "-Wall -Werror -Wno-deprecated-declarations" CACHE STRING "" )

 message(STATUS "No build type selected, default to Debug")

# Set install location

Most of this is just my other settings I use. The important part for cross compiling is the first block where it sets up the C compiler.

The following cmake command generates a build system for a project and then builds it:

cmake -H. -Bbuild/mips -DCMAKE_TOOLCHAIN_FILE=mips.cmake
cmake --build build/mips --target install

This will generate the build system in the directory build/mips.

One your program is built you can scp it over to the mips VM and run it. I used this to test CryptoLib. The MD5 and SHA1 algorithms were broken and the test vectors failed in Big-Endian originally until I fixed them up.


Why care about Big Endian?

Since the demise of PowerPC and Sparc, there are not a lot of Big Endian machines around for most people to access. These days the apart from some large mainframes, the only Big Endian machines left tend to be network routers using MIPS (and even then a lot of MIPS devices are run in Little Endian mode).

Why should you care about Big Endian? Probably for the most part you don’t ever need to. It seems likely that Little Endian has won. However you may not wish to limit yourself. You never know when your project needs to be ported to a different system that you didn’t anticipate.

When I started my programming job 15 years ago the only thing my company cared about was Windows 32 bit. Windows 2000 was the OS of choice. Mac and Linux were irrelevantly small and mobile platforms did not exist. Even 64 bit x64 had not shown up yet. Coding was easy, and fully centred around Windows APIs. Unfortunately this led to some poor assumptions made and bad programming practices. A few years later we needed to support the new x64 (Back when it was called AMD64). That took about 6 months of updating our code base. There were various places where assumptions had been made that a void* (PVOID in Windows.h terms) could be happily exchanged with a DWORD or ULONG (Windows terms for 32 bit unsigned ints).

After we had cleared the 64 bit hurdle we were then set for many years. Cross platform compatibility to us meant Windows with x86 or x64. Then along came the mobile platforms running on ARM chips. The big players were iOS and Android. Both of which have a unix ancestry. Before these were relevant platforms for us to code on we had started new projects and had anticipated that we would need to be cross platform in the future. We wrote a large compatibility library rather than directly using Windows.h and its APIs. We did however get this quite wrong. Our compatibility library was all based around the way things work in Windows and when we finally came to start porting it to other platforms we discovered that again we had assumptions that didn’t hold. Now currently, in 2017, we build for Windows, MacOS, Linux, iOS, and Android with x86, x64, Arm, and Arm64 architectures. Build time has certainly increased! We still have to live with some of the early poor assumptions we made (such that assuming that using Windows wide chars (UTF16) would be future proofing ourselves!) but overall now have a pretty good handle of working over multiple platforms and architectures.

My point after this long ramble, is that you never know where your project will go and what kind of things will be important in the future. Maybe in 10 years time, the new popular computers are all Big Endian. So you might just want to make sure your code works on it to save yourself headaches down the line.

Some people say you should never worry about Endianess and always write in an Endian-netural manner. This maybe good advice, but it is incredibly easy to slip up. It just takes you writing a number to disc straight out of memory to have made  the code no longer Endian-neutral. So without actually being able to test it, how will you know for sure?

There are two ways of dealing with Endianness that I can see

  1. Never copy memory bytes into words. eg never do the following: memcpy( &my32BitValue, myByteArray, sizeof(uint32_t). Instead always deal byte at a time and assemble your 32bit Value with bit shifts and ORs.
  2. Just wrap everything in htons and ntohs style wrappers before using them.

You will still want to test that it actually worked though. This is the rather tricky part as you most likely won’t have access to a Big Endian machine. However it is possible to emulate one using QEMU. In the next article I will give some concise instructions to get you up and going with an emulated MIPS Linux setup.


CryptLib 2.0.0

After a four year gap, I have updated CryptLib at GitHub. I have now released version 2.0.0.


  • Added AES and AES-CTR modules. AES-CTR conforms to the same counter mode used with AES in OpenSSL.
  • All algorithms now work on Big-Endian architectures.
  • Now uses CMake for building rather than make files and Visual Studio projects. CMake will generate whatever system is required.
  • Input function parameters are now marked const
  • File names have been changed to have the prefix CryptLib_ rather than Lib
  • Various formatting changes to the files.

This is free and unencumbered software released into the public domain.