AES-CBC

The previous two cipher modes of AES I wrote into WjCryptLib were AES-CTR and AES-OFB. Both of these turn AES into stream ciphers. In both cases only the AES block encrypt function is used. So today I add AES-CBC (Cipher Block Chaining) mode to the library. I don’t particularly like CBC as a mode personally, however it is one of the most common modes used so I wanted to include it in the library.

Cipher Block Chaining mode works by XORing the previous cipher block onto the plaintext before performing the block encrypt. An IV is used as the first “previous cipher block”. A change in a byte of plaintext will cause all the following cipher text to be different. A disadvantage of the mode is that it has to work with whole number of blocks (16 bytes in the case of AES). This limitation is usually overcome by padding the last block and keeping a count value of the actual data. There is also a fancier technique called cipher text stealing which reduces the limitation to only requiring a minimum of a a single block. I have not included this technique I my implementation.

CBC is not a stream cipher mode, as in it does not generate a parallel stream of bytes that are then applied (usually with XOR) onto the input stream. CBC uses the block encrypt and decrypt block functions on the input data.

I have released WjCryptLib 2.3.0 which contains AES-CBC.

The relevant source files needed are:

This is free and unencumbered software released into the public domain.

 

Advertisements

Pushing old data off a disc

I often like to clean out deleted data from discs. Especially ones that are going to be recycled and used by other people. The problem I find is that secure wiping programs are just too slow. I don’t need that level of protection, I just want to quickly write over every block on the disc.

Recently I was trying to delete a large (1TB) drive. It had been formatted and I just wanted to fill up all the blocks on it with a huge file and then delete the file. This way I would be fairly confident that every block on the disc had been overwritten. The fastest way is to copy /dev/zero onto a file on disc. However I never feel confident that writing zeros actually overwrites anything. It would be very easy for the underlying device to simply mark the block as all zero rather than physically writing it. I believe the old ZIPDRIVE discs did something like this.

Instead of using /dev/zero the obvious solution is to use /dev/random or /dev/urandom. However these are far slower due to generating cryptographically secure random numbers. This is overkill for what I was trying to do. I just wanted to ensure that something was written. In the end I opted for making some big files withs /dev/random and then appending the file over and over until the disc was full.

It seemed unlikely that the device would be able to detect repeated blocks being written but it still niggled at me. Also it was not a particularly convenient method. I just wanted to run something and leave it until the disc was full. So I wrote a small tool called PushFill. This will keep writing data to a file until it runs out of space.

This uses RC4 to create a random 10Mbyte block of data which it writes to the file. It then writes the same 10Mbytes another 255 times, each time with every byte incremented by 1. After 256 writes (2.5G) it starts again with another 10Mbyte block from the RC4 stream. This way the RC4 generator is only used for a small percentage of the time and therefore does not slow down the writing. The step of incrementing each byte in the block by 1 is barely noticeable.

The advantage of this method is it is very fast, while still making every single block written different. Therefore the underlying system can not do any smart cheats such as noticing repeated blocks (think of how DropBox works, where each unique block is hashed and only physically stored once ever). Additionally the output of RC4 prevents any disc compression being able to use less physical blocks to store the data.

The syntax is simple:

PushFill <filename>

This will create or append to the specified filename. It will keep on writing until the disc is full, or program is aborted (ctrl-c).

Every two seconds the program will display how much it wrote in that time along with its rate. It will also display the total amount written so far and the average rate.

A sample output:

Block:    1.9 GB  Rate:  948.1 MBps  |  Total:    1.9 GB  AvgRate:  948.1 MBps
Block:    2.1 GB  Rate:    1.1 GBps  |  Total:    4.0 GB  AvgRate: 1017.2 MBps
Block:    2.3 GB  Rate:    1.1 GBps  |  Total:    6.3 GB  AvgRate:    1.0 GBps

 

Some systems will cache writes so the first few seconds will show a much higher rate than its actually writing to the disc.

Compiled binaries for the program are available here. The package contains binaries for Windows x64, MacOS x64, Linux x64, and Linux Arm (eg a Raspberry Pi).

Full source code available on GitHub here.

This is free and unencumbered software released into the public domain.

 

AES-OFB

I previously added AES-CTR to my library WjCryptLib. AES-CTR is by far the best way to use AES as stream cipher. However it was not the first mode of operation devised for using a block cipher as a stream cipher. Output-Feedback-Mode (OFB) was one of the original modes of operation specified for the original block ciphers like DES. The way OFB works is to start with an IV the size of a block and repeatedly encrypt it. Each encryption produces another block worth of stream bytes.

When running as a single thread AES-OFB is exactly the same speed as AES-CTR. However it can not be parallelised, nor can the stream be synced to an arbitrary location. So in pretty much every situation AES-CTR is a better choice than AES-OFB. However if you are required to use AES-OFB due to a pre specified protocol then there are times you may need it. I have added AES-OFB to WjCryptLib.

Public domain C source code for AES-OFB:

These depend on the AES block cipher files:

Parallelising AES-CTR with OpenMP

In order for AES (or any block cipher) to be particularly useful you generally need to use it in a “mode of operation” which allows it to work with much larger data than a single 128 bit block. Some modes of operation use the underlying block cipher and create a stream cipher. Both AES-OFB and AES-CTR are stream ciphers that use the outputs of the block encryption to produce 16 bytes of stream output at a time. They operate very similarly and when single threaded are the same speed.

AES-OFB keeps on applying the AES encryption on the same block to generate a new block each time. Whereas AES-CTR apples AES encryption on a “counter” block. Cryptanalysis has shown both methods to be equally secure. AES-CTR has two huge advantages over AES-OFB which is why in my opinion it should be used in preference always.

The first advantage is that you can jump to any point in the output stream without having to generate preceding bytes. This would be useful if you were appending to a file for example.

The second advantage is that because each block is calculated independently of other blocks, the entire process can be parallelised. If 1 MB of data is to be encrypted with AES-CTR then on a quad core processor the task can be split into encrypting 256 kB, each performed in a seperate thread. It is not possible to parallelise AES-OFB as each block requires the previous block to have been calculated first.

Parallelising an algorithm such as AES-CTR is a bit different from regular multithreading. If the AES-CTR library had to be responsible for the threading it puts in quite a burden of management, and also the overhead of bringing up threads and closing them each time its called could be more than the improvements in speed. Alternatively providing a set of interfaces to allow the caller to provide the multithread environment could be quite cumbersome and tricky to use. Fortunately there is a standard that is perfect for this job, OpenMP.

OpenMP is a standard that is implemented in many C compilers (including, surprisingly, Microsoft Visual Studio). This allows functions to be marked-up using special pragmas that will allow them to be parallelised when built with OpenMP support, and also run correctly in a single thread without OpenMP.

The following mark up will cause the for loop to be parallelised with OpenMP. The for loop will run on as many threads as there are processing cores (by default, this can be changed). Each thread will run a smaller subset of the range i – numIterations. Without OpenMP this will just run as normal over the entire range.

#ifdef _OPENMP
    #pragma omp parallel for
#endif
for( i=0; i<numIterations; i++ )
{
    ...
}

There are several extra markups that can be added that control how the threads share data etc. The #ifdef is not technically needed as the #pragma will be ignored by compilers that don’t understand it. However some compilers will warn about unknown pragmas so it can be quieter to #ifdef it.

My AES-CTR implementation without OpenMP gave the following results on a quad core MacBookPro running Linux.

AES-CTR (128) 232.33 Mbps
AES-CTR (192) 203.29 Mbps
AES-CTR (256) 177.69 Mbps
RC4 368.89 Mbps

As expected RC4 is the fastest as it is much simpler.

I reworked my AES-CTR implementation to work with OpenMP and the performance increase is considerable.

AES-CTR (128) 730.00 Mbps
AES-CTR (192) 518.44 Mbps
AES-CTR (256) 489.51 Mbps
RC4 363.46 Mbps

This is over 3 times as fast. Interestingly it is not 4 times as fast, despite now running 100% CPU usage over 4 cores instead of just 1. I assume the reason is because the 4 processors still need to access the same memory controller and that becomes the bottleneck. The memory controller can’t serve all four cores simultaneously at the same speed it can service just one.

As a side note, when I first wrote this using my original AES implementation, it was so slow that even with the paralysation it was still out performed by RC4! This motivated me to change the AES block cipher implementation to a faster one.

The nice thing about OpenMP is that it can just be enabled at build time without a lot of fuss (-fopenmp for gcc and /openmp for MSVC). If it is not enabled the code works fine in a single thread.

There is one disappointing thing about Microsoft’s implementation. It requires the use of a DLL which is not a system DLL. Visual Studio 2017 dynamically links OpenMP executables to VCOMP140.DLL. There is no option to statically link it. Also Apple’s version of clang that comes with Xcode does not support OpenMP.

Public Domain C source code for AES-CTR with optional OpenMP support

This is free and unencumbered software released into the public domain.

 

A faster AES implementation

When I added AES to CryptLib I found a nice compact public domain implementation to adapt. I liked it because it was quite a bit smaller in source code than the majority of implementations I had seen, and the final binary size of the executables were small. However when I wrote a speed test for it I discovered it had incredibly poor speed. The trade-off for the compactness was speed. In some cases this is desirable as it may need to be in a very small processor and doesn’t have to run particularly fast. However I wanted a faster one for CryptLib so I found a different implementation that is almost 5 times as fast.

Public domain C source code for AES

(Alternative links: CryptLib_Aes.c and CryptLib_Aes.h)

This is free and unencumbered software released into the public domain.

 

Setting up a virtual Big Endian environment

I mentioned previously that it is a good idea to be able to actually test your code works on a Big-Endian architecture. I recently updated CryptLib and made it compatible for Big-Endian. The way I tested it was to use QEMU to emulate a Big-Endian MIPS processor running Linux. QEMU creates Virtual Machines, but unlike VMWare (which virtualises the computer but leaves instructions running on the hardware) it emulates the CPU. This is great because we can now run an operating system for a different chip. It is however extremely slow.

There is pretty much a Linux distribution for every CPU ever made, so you should be able to install a version of Linux for any chosen chip emulation in QEMU. I chose to use MIPS because I know that is big-endian and is also the chips used in a bunch of OPEN-WRT routers, so I figured it would be supported.

Installing the OS from an installation .iso takes a long time, however I was lucky enough to find that someone had made a QEMU image of a Debian MIPS installation available on the Internet: https://people.debian.org/~aurel32/qemu/mips/

From that site download the two files:

Install QEMU if you haven’t already got it. Then run the following command

qemu-system-mips -M malta -kernel vmlinux-3.2.0-4-4kc-malta -hda debian\_wheezy\_mips\_standard.qcow2 -append "root=/dev/sda1 console=tty0" -nographic -redir tcp:2222::22 -m 256

This tells QEMU to boot up the image with no graphics terminal, and instead output the console to stdout. This means the output will just appear in your terminal rather than coming up as a separate window.

QEMU will by default give this a NAT network interface. The -redir flag says to redirect the host port 2222 to the virtual machine’s internal 22 (SSH) port. This makes it easy to SSH into the guest from another computer on your network by SSHing to port 2222 of your host machine.

When you run this you will see Linux slowly boot up. After a while you will get a login prompt. This image has been setup with the user accounts:

user (password user) as a standard user account.
root (password root) as root account.

It is a fairly minimal install, but it does have SSH on, which for my purposes was sufficient. as that is enough to copy files onto it (using scp) and then to execute them.

You can add to this image by installing gcc etc, however actually using the VM to build is extremely slow. I did attempt to do it, but it is not worth the effort. For example to install cmake required building it from source. That took 24 hours!

Instead of trying to build on the VM, the best approach is to build on your actual computer using a cross-compiler.

So if you are using Linux then it is easy, simply run

sudo apt-get install gcc-mips-linux-gnu g++-mips-linux-gnu

to install a MIPS cross compiler.

If you use CMake then it is very easy to use a ToolChain file that is setup to use the mips compiler. Here is my toolchain file. I call it mips.cmake

# toolchain for compiling Linux for MIPS

set(CMAKE_C_COMPILER mips-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mips-linux-gnu-g++)
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR mips)

# Compiler settings
set( CMAKE_C_FLAGS "-Wall -Werror -Wno-deprecated-declarations" CACHE STRING "" )
set( CMAKE_C_FLAGS_DEBUG "-D_DEBUG -g" CACHE STRING "" )
set( CMAKE_C_FLAGS_RELEASE "-DNDEBUG -O3" CACHE STRING "" )
 
set( CMAKE_CXX_FLAGS ${CMAKE_C_FLAGS} CACHE STRING "" )
set( CMAKE_CXX_FLAGS_DEBUG ${CMAKE_C_FLAGS_DEBUG} CACHE STRING "" )
set( CMAKE_CXX_FLAGS_RELEASE ${CMAKE_C_FLAGS_RELEASE} CACHE STRING "" )

if (NOT CMAKE_BUILD_TYPE)
 message(STATUS "No build type selected, default to Debug")
 set( CMAKE_BUILD_TYPE "Debug" CACHE STRING "" )
endif()

# Set install location
set( CMAKE_INSTALL_PREFIX "${CMAKE_SOURCE_DIR}/bin/${CMAKE_BUILD_TYPE}/LinuxMips" CACHE STRING "")

Most of this is just my other settings I use. The important part for cross compiling is the first block where it sets up the C compiler.

The following cmake command generates a build system for a project and then builds it:

cmake -H. -Bbuild/mips -DCMAKE_TOOLCHAIN_FILE=mips.cmake
cmake --build build/mips --target install

This will generate the build system in the directory build/mips.

One your program is built you can scp it over to the mips VM and run it. I used this to test CryptoLib. The MD5 and SHA1 algorithms were broken and the test vectors failed in Big-Endian originally until I fixed them up.

 

Why care about Big Endian?

Since the demise of PowerPC and Sparc, there are not a lot of Big Endian machines around for most people to access. These days the apart from some large mainframes, the only Big Endian machines left tend to be network routers using MIPS (and even then a lot of MIPS devices are run in Little Endian mode).

Why should you care about Big Endian? Probably for the most part you don’t ever need to. It seems likely that Little Endian has won. However you may not wish to limit yourself. You never know when your project needs to be ported to a different system that you didn’t anticipate.

When I started my programming job 15 years ago the only thing my company cared about was Windows 32 bit. Windows 2000 was the OS of choice. Mac and Linux were irrelevantly small and mobile platforms did not exist. Even 64 bit x64 had not shown up yet. Coding was easy, and fully centred around Windows APIs. Unfortunately this led to some poor assumptions made and bad programming practices. A few years later we needed to support the new x64 (Back when it was called AMD64). That took about 6 months of updating our code base. There were various places where assumptions had been made that a void* (PVOID in Windows.h terms) could be happily exchanged with a DWORD or ULONG (Windows terms for 32 bit unsigned ints).

After we had cleared the 64 bit hurdle we were then set for many years. Cross platform compatibility to us meant Windows with x86 or x64. Then along came the mobile platforms running on ARM chips. The big players were iOS and Android. Both of which have a unix ancestry. Before these were relevant platforms for us to code on we had started new projects and had anticipated that we would need to be cross platform in the future. We wrote a large compatibility library rather than directly using Windows.h and its APIs. We did however get this quite wrong. Our compatibility library was all based around the way things work in Windows and when we finally came to start porting it to other platforms we discovered that again we had assumptions that didn’t hold. Now currently, in 2017, we build for Windows, MacOS, Linux, iOS, and Android with x86, x64, Arm, and Arm64 architectures. Build time has certainly increased! We still have to live with some of the early poor assumptions we made (such that assuming that using Windows wide chars (UTF16) would be future proofing ourselves!) but overall now have a pretty good handle of working over multiple platforms and architectures.

My point after this long ramble, is that you never know where your project will go and what kind of things will be important in the future. Maybe in 10 years time, the new popular computers are all Big Endian. So you might just want to make sure your code works on it to save yourself headaches down the line.

Some people say you should never worry about Endianess and always write in an Endian-netural manner. This maybe good advice, but it is incredibly easy to slip up. It just takes you writing a number to disc straight out of memory to have made  the code no longer Endian-neutral. So without actually being able to test it, how will you know for sure?

There are two ways of dealing with Endianness that I can see

  1. Never copy memory bytes into words. eg never do the following: memcpy( &my32BitValue, myByteArray, sizeof(uint32_t). Instead always deal byte at a time and assemble your 32bit Value with bit shifts and ORs.
  2. Just wrap everything in htons and ntohs style wrappers before using them.

You will still want to test that it actually worked though. This is the rather tricky part as you most likely won’t have access to a Big Endian machine. However it is possible to emulate one using QEMU. In the next article I will give some concise instructions to get you up and going with an emulated MIPS Linux setup.