Author: wyncliffe

Pushing old data off a disc

I often like to clean out deleted data from discs. Especially ones that are going to be recycled and used by other people. The problem I find is that secure wiping programs are just too slow. I don’t need that level of protection, I just want to quickly write over every block on the disc.

Recently I was trying to delete a large (1TB) drive. It had been formatted and I just wanted to fill up all the blocks on it with a huge file and then delete the file. This way I would be fairly confident that every block on the disc had been overwritten. The fastest way is to copy /dev/zero onto a file on disc. However I never feel confident that writing zeros actually overwrites anything. It would be very easy for the underlying device to simply mark the block as all zero rather than physically writing it. I believe the old ZIPDRIVE discs did something like this.

Instead of using /dev/zero the obvious solution is to use /dev/random or /dev/urandom. However these are far slower due to generating cryptographically secure random numbers. This is overkill for what I was trying to do. I just wanted to ensure that something was written. In the end I opted for making some big files withs /dev/random and then appending the file over and over until the disc was full.

It seemed unlikely that the device would be able to detect repeated blocks being written but it still niggled at me. Also it was not a particularly convenient method. I just wanted to run something and leave it until the disc was full. So I wrote a small tool called PushFill. This will keep writing data to a file until it runs out of space.

This uses RC4 to create a random 10Mbyte block of data which it writes to the file. It then writes the same 10Mbytes another 255 times, each time with every byte incremented by 1. After 256 writes (2.5G) it starts again with another 10Mbyte block from the RC4 stream. This way the RC4 generator is only used for a small percentage of the time and therefore does not slow down the writing. The step of incrementing each byte in the block by 1 is barely noticeable.

The advantage of this method is it is very fast, while still making every single block written different. Therefore the underlying system can not do any smart cheats such as noticing repeated blocks (think of how DropBox works, where each unique block is hashed and only physically stored once ever). Additionally the output of RC4 prevents any disc compression being able to use less physical blocks to store the data.

The syntax is simple:

PushFill <filename>

This will create or append to the specified filename. It will keep on writing until the disc is full, or program is aborted (ctrl-c).

Every two seconds the program will display how much it wrote in that time along with its rate. It will also display the total amount written so far and the average rate.

A sample output:

Block:    1.9 GB  Rate:  948.1 MBps  |  Total:    1.9 GB  AvgRate:  948.1 MBps
Block:    2.1 GB  Rate:    1.1 GBps  |  Total:    4.0 GB  AvgRate: 1017.2 MBps
Block:    2.3 GB  Rate:    1.1 GBps  |  Total:    6.3 GB  AvgRate:    1.0 GBps


Some systems will cache writes so the first few seconds will show a much higher rate than its actually writing to the disc.

Compiled binaries for the program are available here. The package contains binaries for Windows x64, MacOS x64, Linux x64, and Linux Arm (eg a Raspberry Pi).

Full source code available on GitHub here.

This is free and unencumbered software released into the public domain.




I previously added AES-CTR to my library WjCryptLib. AES-CTR is by far the best way to use AES as stream cipher. However it was not the first mode of operation devised for using a block cipher as a stream cipher. Output-Feedback-Mode (OFB) was one of the original modes of operation specified for the original block ciphers like DES. The way OFB works is to start with an IV the size of a block and repeatedly encrypt it. Each encryption produces another block worth of stream bytes.

When running as a single thread AES-OFB is exactly the same speed as AES-CTR. However it can not be parallelised, nor can the stream be synced to an arbitrary location. So in pretty much every situation AES-CTR is a better choice than AES-OFB. However if you are required to use AES-OFB due to a pre specified protocol then there are times you may need it. I have added AES-OFB to WjCryptLib.

Public domain C source code for AES-OFB:

These depend on the AES block cipher files:

Parallelising AES-CTR with OpenMP

In order for AES (or any block cipher) to be particularly useful you generally need to use it in a “mode of operation” which allows it to work with much larger data than a single 128 bit block. Some modes of operation use the underlying block cipher and create a stream cipher. Both AES-OFB and AES-CTR are stream ciphers that use the outputs of the block encryption to produce 16 bytes of stream output at a time. They operate very similarly and when single threaded are the same speed.

AES-OFB keeps on applying the AES encryption on the same block to generate a new block each time. Whereas AES-CTR apples AES encryption on a “counter” block. Cryptanalysis has shown both methods to be equally secure. AES-CTR has two huge advantages over AES-OFB which is why in my opinion it should be used in preference always.

The first advantage is that you can jump to any point in the output stream without having to generate preceding bytes. This would be useful if you were appending to a file for example.

The second advantage is that because each block is calculated independently of other blocks, the entire process can be parallelised. If 1 MB of data is to be encrypted with AES-CTR then on a quad core processor the task can be split into encrypting 256 kB, each performed in a seperate thread. It is not possible to parallelise AES-OFB as each block requires the previous block to have been calculated first.

Parallelising an algorithm such as AES-CTR is a bit different from regular multithreading. If the AES-CTR library had to be responsible for the threading it puts in quite a burden of management, and also the overhead of bringing up threads and closing them each time its called could be more than the improvements in speed. Alternatively providing a set of interfaces to allow the caller to provide the multithread environment could be quite cumbersome and tricky to use. Fortunately there is a standard that is perfect for this job, OpenMP.

OpenMP is a standard that is implemented in many C compilers (including, surprisingly, Microsoft Visual Studio). This allows functions to be marked-up using special pragmas that will allow them to be parallelised when built with OpenMP support, and also run correctly in a single thread without OpenMP.

The following mark up will cause the for loop to be parallelised with OpenMP. The for loop will run on as many threads as there are processing cores (by default, this can be changed). Each thread will run a smaller subset of the range i – numIterations. Without OpenMP this will just run as normal over the entire range.

#ifdef _OPENMP
    #pragma omp parallel for
for( i=0; i<numIterations; i++ )

There are several extra markups that can be added that control how the threads share data etc. The #ifdef is not technically needed as the #pragma will be ignored by compilers that don’t understand it. However some compilers will warn about unknown pragmas so it can be quieter to #ifdef it.

My AES-CTR implementation without OpenMP gave the following results on a quad core MacBookPro running Linux.

AES-CTR (128) 232.33 Mbps
AES-CTR (192) 203.29 Mbps
AES-CTR (256) 177.69 Mbps
RC4 368.89 Mbps

As expected RC4 is the fastest as it is much simpler.

I reworked my AES-CTR implementation to work with OpenMP and the performance increase is considerable.

AES-CTR (128) 730.00 Mbps
AES-CTR (192) 518.44 Mbps
AES-CTR (256) 489.51 Mbps
RC4 363.46 Mbps

This is over 3 times as fast. Interestingly it is not 4 times as fast, despite now running 100% CPU usage over 4 cores instead of just 1. I assume the reason is because the 4 processors still need to access the same memory controller and that becomes the bottleneck. The memory controller can’t serve all four cores simultaneously at the same speed it can service just one.

As a side note, when I first wrote this using my original AES implementation, it was so slow that even with the paralysation it was still out performed by RC4! This motivated me to change the AES block cipher implementation to a faster one.

The nice thing about OpenMP is that it can just be enabled at build time without a lot of fuss (-fopenmp for gcc and /openmp for MSVC). If it is not enabled the code works fine in a single thread.

There is one disappointing thing about Microsoft’s implementation. It requires the use of a DLL which is not a system DLL. Visual Studio 2017 dynamically links OpenMP executables to VCOMP140.DLL. There is no option to statically link it. Also Apple’s version of clang that comes with Xcode does not support OpenMP.

Public Domain C source code for AES-CTR with optional OpenMP support

This is free and unencumbered software released into the public domain.


A faster AES implementation

When I added AES to CryptLib I found a nice compact public domain implementation to adapt. I liked it because it was quite a bit smaller in source code than the majority of implementations I had seen, and the final binary size of the executables were small. However when I wrote a speed test for it I discovered it had incredibly poor speed. The trade-off for the compactness was speed. In some cases this is desirable as it may need to be in a very small processor and doesn’t have to run particularly fast. However I wanted a faster one for CryptLib so I found a different implementation that is almost 5 times as fast.

Public domain C source code for AES

(Alternative links: CryptLib_Aes.c and CryptLib_Aes.h)

This is free and unencumbered software released into the public domain.


Setting up a virtual Big Endian environment

I mentioned previously that it is a good idea to be able to actually test your code works on a Big-Endian architecture. I recently updated CryptLib and made it compatible for Big-Endian. The way I tested it was to use QEMU to emulate a Big-Endian MIPS processor running Linux. QEMU creates Virtual Machines, but unlike VMWare (which virtualises the computer but leaves instructions running on the hardware) it emulates the CPU. This is great because we can now run an operating system for a different chip. It is however extremely slow.

There is pretty much a Linux distribution for every CPU ever made, so you should be able to install a version of Linux for any chosen chip emulation in QEMU. I chose to use MIPS because I know that is big-endian and is also the chips used in a bunch of OPEN-WRT routers, so I figured it would be supported.

Installing the OS from an installation .iso takes a long time, however I was lucky enough to find that someone had made a QEMU image of a Debian MIPS installation available on the Internet:

From that site download the two files:

Install QEMU if you haven’t already got it. Then run the following command

qemu-system-mips -M malta -kernel vmlinux-3.2.0-4-4kc-malta -hda debian\_wheezy\_mips\_standard.qcow2 -append "root=/dev/sda1 console=tty0" -nographic -redir tcp:2222::22 -m 256

This tells QEMU to boot up the image with no graphics terminal, and instead output the console to stdout. This means the output will just appear in your terminal rather than coming up as a separate window.

QEMU will by default give this a NAT network interface. The -redir flag says to redirect the host port 2222 to the virtual machine’s internal 22 (SSH) port. This makes it easy to SSH into the guest from another computer on your network by SSHing to port 2222 of your host machine.

When you run this you will see Linux slowly boot up. After a while you will get a login prompt. This image has been setup with the user accounts:

user (password user) as a standard user account.
root (password root) as root account.

It is a fairly minimal install, but it does have SSH on, which for my purposes was sufficient. as that is enough to copy files onto it (using scp) and then to execute them.

You can add to this image by installing gcc etc, however actually using the VM to build is extremely slow. I did attempt to do it, but it is not worth the effort. For example to install cmake required building it from source. That took 24 hours!

Instead of trying to build on the VM, the best approach is to build on your actual computer using a cross-compiler.

So if you are using Linux then it is easy, simply run

sudo apt-get install gcc-mips-linux-gnu g++-mips-linux-gnu

to install a MIPS cross compiler.

If you use CMake then it is very easy to use a ToolChain file that is setup to use the mips compiler. Here is my toolchain file. I call it mips.cmake

# toolchain for compiling Linux for MIPS

set(CMAKE_C_COMPILER mips-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mips-linux-gnu-g++)

# Compiler settings
set( CMAKE_C_FLAGS "-Wall -Werror -Wno-deprecated-declarations" CACHE STRING "" )

 message(STATUS "No build type selected, default to Debug")

# Set install location

Most of this is just my other settings I use. The important part for cross compiling is the first block where it sets up the C compiler.

The following cmake command generates a build system for a project and then builds it:

cmake -H. -Bbuild/mips -DCMAKE_TOOLCHAIN_FILE=mips.cmake
cmake --build build/mips --target install

This will generate the build system in the directory build/mips.

One your program is built you can scp it over to the mips VM and run it. I used this to test CryptoLib. The MD5 and SHA1 algorithms were broken and the test vectors failed in Big-Endian originally until I fixed them up.


Why care about Big Endian?

Since the demise of PowerPC and Sparc, there are not a lot of Big Endian machines around for most people to access. These days the apart from some large mainframes, the only Big Endian machines left tend to be network routers using MIPS (and even then a lot of MIPS devices are run in Little Endian mode).

Why should you care about Big Endian? Probably for the most part you don’t ever need to. It seems likely that Little Endian has won. However you may not wish to limit yourself. You never know when your project needs to be ported to a different system that you didn’t anticipate.

When I started my programming job 15 years ago the only thing my company cared about was Windows 32 bit. Windows 2000 was the OS of choice. Mac and Linux were irrelevantly small and mobile platforms did not exist. Even 64 bit x64 had not shown up yet. Coding was easy, and fully centred around Windows APIs. Unfortunately this led to some poor assumptions made and bad programming practices. A few years later we needed to support the new x64 (Back when it was called AMD64). That took about 6 months of updating our code base. There were various places where assumptions had been made that a void* (PVOID in Windows.h terms) could be happily exchanged with a DWORD or ULONG (Windows terms for 32 bit unsigned ints).

After we had cleared the 64 bit hurdle we were then set for many years. Cross platform compatibility to us meant Windows with x86 or x64. Then along came the mobile platforms running on ARM chips. The big players were iOS and Android. Both of which have a unix ancestry. Before these were relevant platforms for us to code on we had started new projects and had anticipated that we would need to be cross platform in the future. We wrote a large compatibility library rather than directly using Windows.h and its APIs. We did however get this quite wrong. Our compatibility library was all based around the way things work in Windows and when we finally came to start porting it to other platforms we discovered that again we had assumptions that didn’t hold. Now currently, in 2017, we build for Windows, MacOS, Linux, iOS, and Android with x86, x64, Arm, and Arm64 architectures. Build time has certainly increased! We still have to live with some of the early poor assumptions we made (such that assuming that using Windows wide chars (UTF16) would be future proofing ourselves!) but overall now have a pretty good handle of working over multiple platforms and architectures.

My point after this long ramble, is that you never know where your project will go and what kind of things will be important in the future. Maybe in 10 years time, the new popular computers are all Big Endian. So you might just want to make sure your code works on it to save yourself headaches down the line.

Some people say you should never worry about Endianess and always write in an Endian-netural manner. This maybe good advice, but it is incredibly easy to slip up. It just takes you writing a number to disc straight out of memory to have made  the code no longer Endian-neutral. So without actually being able to test it, how will you know for sure?

There are two ways of dealing with Endianness that I can see

  1. Never copy memory bytes into words. eg never do the following: memcpy( &my32BitValue, myByteArray, sizeof(uint32_t). Instead always deal byte at a time and assemble your 32bit Value with bit shifts and ORs.
  2. Just wrap everything in htons and ntohs style wrappers before using them.

You will still want to test that it actually worked though. This is the rather tricky part as you most likely won’t have access to a Big Endian machine. However it is possible to emulate one using QEMU. In the next article I will give some concise instructions to get you up and going with an emulated MIPS Linux setup.


CryptLib 2.0.0

After a four year gap, I have updated CryptLib at GitHub. I have now released version 2.0.0.


  • Added AES and AES-CTR modules. AES-CTR conforms to the same counter mode used with AES in OpenSSL.
  • All algorithms now work on Big-Endian architectures.
  • Now uses CMake for building rather than make files and Visual Studio projects. CMake will generate whatever system is required.
  • Input function parameters are now marked const
  • File names have been changed to have the prefix CryptLib_ rather than Lib
  • Various formatting changes to the files.

This is free and unencumbered software released into the public domain.


When I was writing this blog four years ago I wanted to write cross-platform C code that was simple to build. At the time I settled on a dual approach of using a make file for non-windows and a set of Visual Studio 2010 projects and solution files for Windows building. I’ve never liked the dual approach but at the time I did not have anything better.

These days I use CMake as my build system. To be technically correct CMake is not a build system, it is a builder of build systems. This is what makes it great. Rather than being locked into a particular system, you write all your build files and dependencies in a language separate from the build system. CMake then will generate the build files for your chosen build system.

Something that CMake is particularly good at is building “out-of-source” builds. Something that Visual Studio is awful at by default. The standard method of Visual Studio is to scatter all the build artefacts throughout your source tree. This is particularly annoying for source control (you end up with tons of rules in .gitignore to avoid committing binaries and temporary files), and also makes it hard to have a clean build as you never know for sure what left overs may be lurking around affecting the build. With CMake you can just delete your build tree at anytime and generate a new one. You can also have many build trees in parallel. For example you may wish to build using Visual Studio and MinGW.

By default CMake will generate files with the most obvious builder it can find on your system. So if you are running Windows and have Visual Studio it will use that. If you are on Linux it will probably choose gcc and make. However you can tell it to generate a specific build system, and there is a large list of supported ones. In particular for Linux and MacOS I typically like it to build using Ninja which is lightning fast compared to make. Occasionally I get it to generate Xcode projects for MacOS.

Something else I particularly like about CMake is the concept of Toolchain files that can be specified on the command line using the (not especially elegant) syntax of:


It will this run this file first before attempting to locate the compilers for your system. With this you can setup cross-compilers or other custom settings you want without cluttering up your in source CMake files. Also you don’t want to put hard coded paths and settings into your actual project that you are planning to share.

So for example I like to keep my CMakeLists.txt files as simple and clean as possible and put all my custom stuff in my dev environment outside of the source trees. So if someone builds one of my projects it should always work with their system as it will just use the defaults. For example if they are using Visual Studio it will build it using fairly standard Visual Studio settings. I don’t actually like Visual Studio default builds, in particular I don’t like its dependent on its own C runtimes that you then have to distribute. So I use a toolchain file that builds using WinDDK 7.1 which links to MSVCRT.DLL. I also have other toolchain files for building on all the other platforms I want to do, including MinGW, Linux, MacOS, and even a Big-Endian mips Linux system!

I have made a sample “Hello World” project that can be built with CMake. A zip file containing the source is here

Assuming you have CMake installed and in the path then simply open a terminal at the root of the sample directory and run the following

cmake -H. -Bbuild -DCMAKE_INSTALL_PREFIX=bin
cmake --build build --target install

This will create the build system and build the binaries and place them in the subdirectory bin

The first command generates the build system for your platform. In Windows this will probably be Visual Studio, in Linux this will probably be gcc and make. The parameter -H. says “source file starts here”. -Bbuild says “put the build system into the subfolder build“. The -CMAKE_INSTALL_PREFIX=bin says “When building with --target install, make the directory the subfolder bin

The second command says “build using the build system in subdirectory build and place the executables in the install location specified before (bin)”

It is highly recommended to always perform an “out-of-source” build. You can however build “in-source” with the following simpler commands:

cmake .
cmake --build .

This will however places the builds files all over the place in the directory and is generally considered messy. However if you are just wanting to get the binaries build with the least fuss, and are not planning to actually to use the source files for coding then this is the quickest way. In this example we did not specify the install location. The binaries will be built in the source tree and not copied to the install location.

I have since converted all my previous projects to using CMake. I will soon be publishing a newer version of CryptLib that uses CMake (and also has the new AES and AES-CTR cipher in it).


AES as a stream cipher in counter mode (AES-CTR)

In the previous article I introduced the AES block cipher. This, however, is not too much use on its own. AES encrypts a single 128 bit (16 byte) block. In almost every situation there will be a requirement to encrypt a much larger amount of data. This is solved by using AES in a “mode of operation”.

Several modes exist for using AES block cipher on larger data sets. Several of these have been specified by NIST as standards. The first and most obvious one is known as AES-ECB (Electronic code book). This should really not exist as a standard as it is a terrible idea to use in every situation. Basically it just divides the larger data set into 16 bytes chunks and encrypts each block. The problem with this is that any time the same block of input appears, exactly the same output will be produced. For example if you had a file with large runs of all zeros in it, then you would see a repeating pattern on 16 bytes recurring in the output. So never use ECB, there are plenty of good modes to choose from.

Today I am going to talk about AES-CTR (AES in “counter” mode). This is actually my favourite mode. It turns AES into a stream cipher which is very easy to work with, it can be used on any sized data. It is also more convent to use than RC4 because you can jump to any offset within the output stream and start from there. With RC4 you would have to “spin” through all the previous stream to get to a specific point.

AES CTR is very simple mode. Basically you have a counter which you store in a 128 bit block and you then encrypt that block with AES, that is your first 16 bytes of stream output. You then increment the counter and encrypt to produce the second block, and so on. To add another level of  convenience and speed the counter block is also used as an “Initialisation vector” (iv). This allows someone to create an entirely new stream without having to reinitialise the AES key (this is a somewhat expensive procedure).

NIST don’t actually specify how you should combine an IV and the counter, but suggest a few ways. The method I like best is to split the block into two halves. The first half is the 64bit IV. The second half is the 64bit block counter (store in big endian form). This is the method that openssl uses, and the method I chose to use. The advantage of this method over xoring the IV and counter together is that it is guaranteed that you will never “overlap” by having particular chosen IVs and counters. Remember that this is a stream cipher and just like RC4 if you ever reuse the stream you have removed all security. However with AES CTR it is only necessary to have the a unique set of BOTH Key and IV. Therefore the same Key can be reused with unique IVs. Similarly you may not want to bother with the IV at all and only use unique keys. Personally if I am in a situation where I will be generating new keys for each use, I will generate a key 64 bits larger than AES needs and use the last 64 bits as the IV.

Public Domain source code for AES-CTR

These depend on the AES block cipher files:

I have made a simple program called AesCtrOutput that takes a key and and IV in hex form and produces the specified number of bytes out in hex on stdout.


The syntax for this program is pretty simple

 AesCtrOutput <Key> <IV> <NumBytes>
 <Key> - 128, 192, or 256 bit written as hex
 <IV> - 64 bit written as hex
 <NumBytes> - Number of bytes of stream to output


>AesCtrOutput.exe 0102030405060708a1a2a3a4a5a6a7a8 b1b2b3b4b5b6b7b8 48

This result can be confirmed as being the same as openssl by using the following command:

> openssl enc -aes-128-ctr -K 0102030405060708a1a2a3a4a5a6a7a8 -iv b1b2b3b4b5b6b7b8 -in zero.bin -out output.bin

If zero.bin is a file contain 48 zero bytes then output.bin will be created as 48 bytes with the same bytes as in the AesCtrOutput.exe example.

I have made a .zip file containing all the source for AES and AESCTR along with those two sample programs and a unit test set. The root of the folder contains a CMakeLists.txt file for building with CMake.

This is free and unencumbered software released into the public domain.


AES Block Cipher

One of the things I meant to do when I started the WaterJuice blog 4 years ago was to build up library of Cryptographic functions. I got as far as putting in some of the common hashes of MD5, SHA1, and SHA2. And I also had RC4 encryption in there. But then got no further. I always meant to put in block ciphers. So today finally I add AES.

AES is a block cipher that operates on 128 bit blocks. A block cipher is a keyed permutation on a block. Every possible value of a 128 block is mapped with a 1-1 correspondence to another block for any given key. AES takes 3 sizes of keys: 128 bit, 192 bit, and 256 bit.

AES was announced by NIST at the end of 2001, this was after a five year process in selecting the new cipher to replace DES which was woefully inadequate by this time. AES is a subset of the Rijndael cipher which was of several algorithms that competed to become the new standard. NIST did not make any changes to Rijndael except to restrict the possible set of key sizes and block sizes. Rijndael works on block sizes and key sizes between 128 and 256 bit that are multiples of 32bit. AES formalises a single block size and 3 key sizes.

In this article I will introduce public domain C code for encrypting and decrypting a block using the AES cipher. This is a basic build block for many other operations. In general AES on its own is not terribly useful as the chances are you’ll want to encrypt something far larger than a single block, however I will talk about modes of operation in a subsequent article. For now I will just present the block cipher primitive.

Before proceeding I’d like to answer the question “why use AES?”. The best answer for that is because it is a safe bet. As the saying went “no one got fired for choosing IBM”, the same can be said about choosing AES. Because it is the standard, it is also the most scrutinised algorithm, and it is is still standing strong 16 years later.

Public domain C source code for AES

The following is a sample program that uses that code to encrypt or decrypt a single block specified by hex on a command line.


This has a simple syntax of:

AesBlock [-D]

HexKey must be 128, 192, or 256 bits long, and HexBlock must be 128 bits long. If the -D flag is specified then decryption is peformed


waterjuice$ AesBlock.exe 2b7e151628aed2a6abf7158809cf4f3c 6bc1bee22e409f96e93d7e117393172a