Month: June 2013

Coins

I recently decided to collect coins. In particular I want to collect one of every type of circulated $1 Australian coin. According to the Royal Australian Mint, there are 17 commemorative $1 coin designs in circulation, and 15 different years for the standard “kangaroos” design. So a complete set of one of each would make 32 coins. I later discovered there is an error on the RAM website and there is in fact no 1992 standard $1 coin in circulation. I am not counting all the special non circulated coins as they make dozens of designs each year, I wanted ones that could be collected just from regular change.

Getting the first few coins is easy as there is a high chance that you’ll get a coin that you haven’t already got. Getting that last coin of course will be the hardest as there is much more chance that any subsequent coin you get will be one of the other 30 already acquired. I was curious about the probability of getting the coins. Unfortunately it has been too many years since I studied maths and I couldn’t come up with the appropriate equations. Instead I decided to work it out by brute force, using the computer as a handy calculator.

For my initial trial I assumed that there was an equal distribution of coins of each design, and that there were 32. I wrote a program that just randomly picked numbers between 0 and 31 and placed them in “buckets”. Each time it added a coin to a previously empty bucket it would print out the number of coins it had selected up until that point.

I ran this program a few times and gave some good results, but of course it was random trials. I wanted to get the statistical probabilities of coins required to complete the set, so I adapted the program to run multiple trials. Half a million trials in fact. It calculated the mean and standard deviation for the number of coins required to fill n buckets. The random number generator I used was an RC4 stream which produces good statistical random. The results it printed out are:

1 after 1.0 coins  (sd: 0.0)
2 after 2.0 coins  (sd: 0.2)
3 after 3.1 coins  (sd: 0.3)
4 after 4.2 coins  (sd: 0.5)
5 after 5.3 coins  (sd: 0.6)
6 after 6.5 coins  (sd: 0.8)
7 after 7.8 coins  (sd: 0.9)
8 after 9.0 coins  (sd: 1.1)
9 after 10.4 coins  (sd: 1.3)
10 after 11.8 coins  (sd: 1.5)
11 after 13.2 coins  (sd: 1.7)
12 after 14.7 coins  (sd: 1.9)
13 after 16.3 coins  (sd: 2.2)
14 after 18.0 coins  (sd: 2.4)
15 after 19.8 coins  (sd: 2.7)
16 after 21.7 coins  (sd: 3.0)
17 after 23.7 coins  (sd: 3.3)
18 after 25.8 coins  (sd: 3.6)
19 after 28.1 coins  (sd: 4.0)
20 after 30.6 coins  (sd: 4.4)
21 after 33.2 coins  (sd: 4.9)
22 after 36.1 coins  (sd: 5.5)
23 after 39.3 coins  (sd: 6.1)
24 after 42.9 coins  (sd: 6.8)
25 after 46.9 coins  (sd: 7.6)
26 after 51.5 coins  (sd: 8.6)
27 after 56.8 coins  (sd: 9.9)
28 after 63.2 coins  (sd: 11.5)
29 after 71.2 coins  (sd: 13.7)
30 after 81.9 coins  (sd: 17.0)
31 after 97.9 coins  (sd: 23.0)
32 after 129.9 coins  (sd: 39.0)

This table shows that if you have a even distribution of 32 types of coins, then the mean number of coins required to collect is about 130. This means there is a 50% chance of getting a complete set from 130 coins. Using the standard deviation we can make some more predictions. Using the old 65-95-99.7 rule from statistics we can say that there is a 95% chance that a complete set will be found between 52 and 208 coins. Or another way of looking at it is that if you collect 208 coins there is only 2.5% chance of not getting a complete set.

I got a bag of 20 $1 coins from the bank, it had 15 unique coins in it which was bang on for the stats as shown in the table. However the coins are not uniformly distributed. The first $1 coin minted was in 1984 and 186.3 million were produced. This actually represents slightly over 1 in 5 of all $1 coins put into circulation. I also learnt that there are in fact only 31 coins. The number of coins produced with each design varies greatly. The rarest is a 2003 coin where only 4.1 million were made. I’m going to update the program to use the proportions of coins in circulation for a more accurate estimation. Although this reminded me of an interesting thing about distribution of random numbers which I want to talk about in the next article.

As of now I have 24 of the 31 coins, so 7 more to go. It also reminds me of the proof of work concept. Just as the number of first bits being zero increases the number of trials required to find one by chance, the same applies with the coins. Getting the first fire coins are easy, but getting the 31st one will take the most work. Like the proof of work scheme, it is easy for someone to verify that you have a complete set compared to the work required to acquire the set in the first place.

I haven’t made the modifications to do the correct proportions yet, but here is the source code of the current version. I’m not going to bother putting up binaries for this, as it basically prints an almost static table. But in case anyone is interested here it is. Note it links in with LibMd5 and LibRc4 from previous projects.

//////////////////////////////////////////////////////////////////////////////////
//  CoinProb
//
//  Program to attempt to show odds of getting a complete set of 32 unique coins.
//  This is free and unencumbered software released into the public domain
//  June 2013 waterjuice.org
//////////////////////////////////////////////////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////////
//  IMPORTS
//////////////////////////////////////////////////////////////////////////////////

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/timeb.h>
#include <math.h>
#include <memory.h>
#include "LibRc4.h"
#include "LibMd5.h"

//////////////////////////////////////////////////////////////////////////////////
//  CONSTANTS
//////////////////////////////////////////////////////////////////////////////////

#define COIN_TYPES      32
#define NUM_ITERATIONS  500000

//////////////////////////////////////////////////////////////////////////////////
//  TYPES
//////////////////////////////////////////////////////////////////////////////////
typedef struct
{
    uint64_t        ReachedAfterThisManyCoins [COIN_TYPES];
} StatSet;

//////////////////////////////////////////////////////////////////////////////////
//  FUNCTIONS
//////////////////////////////////////////////////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////////
//  SeedStreamWithRandom
//
//  Initialises an RC4 stream with a random key
//////////////////////////////////////////////////////////////////////////////////
static
void
    SeedStreamWithRandom
    (
        Rc4Context*         Context,
        uint8_t             ThreadNum
    )
{
    struct
    {
        uint8_t             threadNum;
        struct timeb        time;
        void*               address;
    } seedValues;
    MD5_HASH                hash;
    Md5Context              md5Context;

    seedValues.threadNum = ThreadNum;
    ftime( &seedValues.time );
    seedValues.address = Context;

    Md5Initialise( &md5Context );
    Md5Update( &md5Context, &seedValues, sizeof(seedValues) );
    Md5Finalise( &md5Context, &hash );

    Rc4Initialise( Context, &hash, sizeof(hash), 0 );
}

//////////////////////////////////////////////////////////////////////////////////
//  AcquireEntireSet
//
//  Gets an entire set of coins.
//////////////////////////////////////////////////////////////////////////////////
static
void
    AcquireEntireSet
    (
        Rc4Context*      RandContext,
        StatSet*         Statistics
    )
{
    int         totalCoins = 0;
    int         uniqueCoins = 0;
    uint32_t    value;
    int         sets [COIN_TYPES] = {0};

    memset( Statistics, 0, sizeof(StatSet) );

    while( uniqueCoins < COIN_TYPES )
    {
        // Get next random coin
        Rc4Output( RandContext, &value, sizeof(value) );
        value %= COIN_TYPES;

        totalCoins += 1;
        sets[value] += 1;
        if( 1 == sets[value] )
        {
            // First time we have seen this coin
            Statistics->ReachedAfterThisManyCoins[uniqueCoins] = totalCoins;
            uniqueCoins += 1;
        }
    }
}

//////////////////////////////////////////////////////////////////////////////////
//  main
//
//  Entry point
//////////////////////////////////////////////////////////////////////////////////
int
    main
    (
        void
    )
{
    Rc4Context  randContext;
    int         iteration;
    StatSet     stats;
    StatSet     statsTotals = {{0}};
    StatSet     statsSquares = {{0}};
    int         i;

    SeedStreamWithRandom( &randContext, 1 );

    for( iteration=0; iteration<NUM_ITERATIONS; iteration++ )
    {
        AcquireEntireSet( &randContext, &stats );
        for( i=0; i<COIN_TYPES; i++ )
        {
            statsTotals.ReachedAfterThisManyCoins[i] 
                += stats.ReachedAfterThisManyCoins[i];
            statsSquares.ReachedAfterThisManyCoins[i] 
                += ( stats.ReachedAfterThisManyCoins[i] 
                    * stats.ReachedAfterThisManyCoins[i] );
        }
    }

    // Display results
    printf( "%u Iterations\n", NUM_ITERATIONS );

    for( i=0; i<COIN_TYPES; i++ ) 
    {
        double  mean;
        double  sigma;
        double  sd;

        mean = (double)statsTotals.ReachedAfterThisManyCoins[i] 
            / (double)NUM_ITERATIONS;
        sigma = ( (double)statsSquares.ReachedAfterThisManyCoins[i] 
            / (double)NUM_ITERATIONS ) - ( mean * mean );
        sd = sqrt( sigma );

        printf( "%u after %.1f coins  (sd: %.1f)\n", i+1, mean, sd );
    }

    return 0;
}

This is one of the things I love the best about programming. While most of the time I am writing programs for specific things at work, its nice to be able to write a program to solve some problem as a hobby. After all the computer is a fantastic calculator and being able to control it to do what you wish is a very handy skill.

Advertisement

Personal GUIDs

NOTE: Page updated August 2019. The original version of the software PersonalGuid contained a bug and did not include the DateOfBirth in the data. It therefore never produced the correct GUIDs. A new version (1.0.1) has been released which fixes the issue (The linked project zip file has been updated).

I’ve always liked GUIDs, and I have always liked the idea of having “personal guids”. While everyone on the planet could be assigned a GUID, it would require some central authority to ensure each person only has one GUID rather than just generating a whole bunch. I like the idea that a GUID can be specifically for someone based on their details, not just freshly generated.

Type 5 GUIDs are perfect for this type of GUID use. A Type 5 GUID is based on a SHA1 hash of some arbitrary namespace ID, and then whatever values make sense for that namespace. So I shall create my own standard for assigning a GUID for everyone.

First I make an ID string that is flexible enough to include everyone, but unique for each person. Also it should be readily repeatable exactly the same way each time for the same person.

My ID format is simple: Its a UTF8 string in uppercase withe semi colons seperating the fields. There are 6 fields. The format is:

Surname;GivenNames;Sex;CountryOfBirth;PlaceOfBirth;DateOfBirth

All values should be uppercase. DateOfBirth is of the format YYYYMMDD in 8 decimal digits. Sex is either ‘M’, ‘F’, or ‘X’. GivenNames may have multiple names, separated by a single space between names. There must be no padding within the fields.

(Note: The original version of this used “Gender” rather than “Sex” and only accepted ‘M’ or ‘F’. This has been updated to include ‘X’ – August 2019)

An example ID string is:

DOE;JOHN;M;AUSTRALIA;SYDNEY;19700101

Next I need an Namespace ID for this namespace I have created. A namespace ID is a GUID itself. I am assigning the GUID:  {5b390b3f-9a62-508a-b235-6e6e8d270720}. I could have generated any new GUID. However I decided to make a Type 5 GUID itself for the Namespace ID. The name GUID is the type 5 GUID of the URL of this article (http://waterjuice.org/2013/06/personal-guids/).

So the Type 5 GUID created from that Namespace ID and the ID string example above is {18448b15-02cc-58f2-857d-c7a2c07728d2}. This is a GUID that is specific to that ID and recreateable at anytime, as long as the name and birthdate and place doesn’t change!

I have modified the program I wrote to create the Type 5 GUIDs for URLs into creating personal GUIDs using this format.

PersonalGuid.zip – C Source Code and Binaries for Windows, OSX, and Linux. This is free and unencumbered software released into the public domain.

Example usage:

>PersonalGuid.exe Doe John m Australia Sydney 19700101
ID: DOE;JOHN;M;AUSTRALIA;SYDNEY;19700101
{897469e0-c10b-5829-bd15-86309b1523cf}

So download and find your own personal GUID. Perhaps one day this will become the basis of an RFC defining this as a standard :-)

GitHub

I’ve put the CryptLib up on GitHub, as I thought that might be a useful way for people to find and get source code.

https://github.com/WaterJuice/CryptLib

This contains the C source code for MD5, SHA1, SHA256, SHA512, and RC4, along with the projects Md5String, Sha1String, Sha256String, Sha512String, and Rc4Output.

The current version (1.0.0) is now on GitHub, the zip can also be downloaded here.

CryptLib_1.0.0.zip

This is free and unencumbered software released into the public domain.

stdbool.h

The built in types of C are not useful enough on their own as they vary between compilers. For example a long is 32 bits on Windows, but 64 bits on OSX. This makes using them in function prototypes problematic if word size is important. A lot of libraries define their own types, but again I don’t want to have this dependency. Fortunately there are some standard .h files that come with most C compilers now. C99 defines several standard files, one is stdint.h. This file defines data types such as uint32_t, which specifies an exact type. Microsoft’s C compilers are not C99 compliant, however they do include a subset. Fortunately they do include stdint.h. Which means I can use those without any problems on Windows. Both Linux and OSX (and probably most other gcc compilers) have stdint.h as well.

While stdint.h provides most of the base types that I want to use, there is one obvious one that is missing. C itself does not define a boolean type, but C99 defines stdbool.h which has the type bool and the value true and false. These are very basic types that I wish were built directly into the compiler. Unfortunately MS C does not come with stdbool.h. Even though its only a couple of lines in length. I’ve seen arguments on the Internet about whether a boolean type is required or useful. Some people think not, but I personally like them. Although the type is nothing special, its in fact just an int or a char probably, the usage of the type is different from a regular integer. Having an explicit type makes code clearer what its use is. For example if I have a function that returns a success that can only have two states (success, or not success), then having the return type a bool is far more clear than having the return an int. If its an int I’m going to wonder what possible values it could return are, maybe it returns different error codes. With a bool return type I can see that it simply returns a true/false state.

Because MSVC doesn’t contain stdbool.h I resisted using it initially with these projects. But I have decided its just too useful to ignore. So I have extended my build environment to have my own version of stdbool.h which is only used on Windows builds. Most of the core library cryptographic functions won’t use bool anyway, but some of the other projects will. This does mean there is one slight dependency when working in Windows, and thats the stdbool.h file I’ve added. It will be very easy for anyone to remove that dependency though. Simply replace bool with int, false with 0, true with 1, and remove #include <stdbool.h>. This could even be done with a very simple script.

The contents of my stdbool.h file are as follows:

#ifndef _STDBOOL_H_
#define _STDBOOL_H_

#ifndef __cplusplus
#ifndef __bool_true_false_are_defined

typedef int     bool;
#define true    1
#define false   0

#define __bool_true_false_are_defined

#endif //__bool_true_false_are_defined
#endif //__cplusplus

#endif //_STDBOOL_H_

The __bool_true_false_are_defined definition is part of the C99 standard. So I check that and don’t redefine if its already set. Additionally I make a check that its not C++ because C++ natively has bool, false, and true.

Building non-Windows platforms from Visual Studio

To build on Windows I am using the makefile. I have a batch file that will call make with the platform I want to build. For Windows this simply calls  make after setting up the environment. In Visual Studio I have set up an external tool on the menu for building. This tool uses the command output window and therefore parses the errors. So I can hit a button on the toolbar and build using make. Any errors are easily navigable into the source. This isn’t surprising as its using the MS compiler so the format is going to be correct. I like to be able to build on all platforms just as easily.

I do most of the development on Windows, and use VS to edit the files, even when working on OSX or Linux as I like its editor. I can’t directly compile on Windows though as I need to compile in either OSX or Linux. My setup is an OSX computer with a Windows VM and a Linux VM (using Vmware). I keep all the files I’m working on on OSX. They Linux and Windows VMs share those same folders using Vmware Shared Folders features (I could also use networking shares). I also have SSH setup from the Windows VM to OSX and Linux. To set off a build on Linux or OSX I use SSH to run make on the appropriate platform. They are all sharing the same files. The directories are simliar but have different bases.

For example on Windows the files are visible from Z:\. This maps to “~” (home directory) on OSX, and on Linux to /mnt/hgfs/osxhome. I have my Build.bat file take a command line parameter to determine which build to do. For example “Windows” to do the Windows build. “OSX” to do the OSX build. If OSX is selected it takes the directory name, and modifies it to the one used by OSX. By removing the drive letter and replacing back slashes with forward slashes. For example Z:\Code\WaterJuice\makefile becomes ~/Code/WaterJuice/makefile. In Linux it would be compe /mnt/hghs/osxhome/Code/WaterJuice/makefile.

So from Visual Studio I can press a button and call Build.bat with Windows, OSX, or Linux. When its OSX or Linux it changes the base name and then calls ssh to run make. The output is displayed in the Visual Studio command output window. This is almost perfect. I do not have to switch VMs, I stay with in Visual Studio and can build on any platform. However the output of GCC is slightly different from Visual studio so it is unable to detect the errors directly. I have a small filter I wrote in Perl that modifies the output to be Visual Studio compatible.

First the file names need to be modified. gcc will obviously output the filenames that it sees. In order for VS to open the correct file they need to be converted to the name the Windows sees. So for example when gcc outputs the file name ./lib/Md5.c (it uses relative paths), the Perl script converts this to Z:\Code\WaterJuice\lib\Md5.c. The other thing it does it change the line output of errors and warnings so that Visual studio recognises them.

Visual studio uses a line format such as:

lib\LibMd5.c(76): error C2065: 'variable' : undeclared identifier

Where as GCC would display the same error as

lib/LibMd5.c:76: error: ‘variable’ undeclared (first use in this function)

So a simple RegEx is used to convert the :nn: part to (nn):  Now with the corrected path Visual Studio will bring up the file at the line number when I press F4.

Here is the Perl script I use. I call it GccToVss.pl

$| = 1;
( $actPath ) = @ARGV;

while ( <STDIN> )
{
    # Replace UTF-8 quote marks with plain ascii
	s/\xe2\x80\x98/\'/g;
	s/\xe2\x80\x99/\'/g;

	# Convert error format and paths from gcc to vss
	s/([\w.\/]+):(\d+):(.*)/$actPath$1($2) : $3/;
    s/\//\\/g;

    print $_;
}

It takes a single parameter which is the actual Windows directory where the root of the project is (the directory containing the makefile).

The following is Build.bat, the batch file that sets off the builds. It also caters for building in Cygwin and an x86 Windows XP build.

@echo off
setlocal

set RELPATH=%~p0
set RELPATH=%RELPATH:\=/%
set ACTPATH=%~d0%~p0

set OSXBASE=~
SET OSXHOST=osxdev@10.0.1.10

set LINUXBASE=/mnt/hgfs/osxdev
SET LINUXHOST=linuxdev@10.0.1.14

SET PROJECT=%2 %3 %4 %5 %6 %7 %8 %9

%~d0
cd %~p0

if /I "%1" == "Windows" goto Windows
if /I "%1" == "WindowsXP" goto WindowsXP
if /I "%1" == "Cygwin" goto Cygwin
if /I "%1" == "OSX" goto OSX
if /I "%1" == "Linux" goto Linux
if /I "%1" == "All" goto All

echo Syntax:
echo    Build ^<target^>
echo      ^<target^> options:
echo        Windows   - Windows x64 Vista Build
echo        WindowsXP - Windows x86 XP Build
echo        Cygwin    - Cygwin Build
echo        OSX       - Remote OSX Build
echo        Linux     - Remote Linux Build
echo        All       - Builds ALL the above
goto :eof

:All
set ALL=1

:Windows
call "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcvarsall.bat" amd64
set CL=/link /LIBPATH:C:\WinDdk\7600.16385.1\lib\wlh\amd64 /LIBPATH:C:\WinDdk\7600.16385.1\lib\crt\amd64 /subsystem:console,6.0
echo.
echo ::::::::::::: Building for Windows x64 Vista :::::::::::::::::
make
echo ---- Build complete for Windows x64 Vista
if "%ALL%" == "" goto :eof
echo.

:WindowsXP
call "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcvarsall.bat" x86
set CL=/link /LIBPATH:C:\WinDdk\7600.16385.1\lib\wxp\i386 /LIBPATH:C:\WinDdk\7600.16385.1\lib\crt\i386 /subsystem:console,5.1
echo ::::::::::::: Building for Windows x86 XP :::::::::::::::::
make
echo ---- Build complete for Windows x86 XP
if "%ALL%" == "" goto :eof
echo.

:Cygwin
echo ::::::::::::: Building for Cygwin :::::::::::::::::
make 2>&1 | tools\GccToVss.pl %ACTPATH%
echo ---- Build complete for Cygwin
if "%ALL%" == "" goto :eof
echo.

:OSX
echo ::::::::::::: Building for OSX :::::::::::::::::
ssh %OSXHOST% "cd %OSXBASE%%RELPATH%; make %PROJECT%" 2>&1 | tools\GccToVss.pl %ACTPATH%
echo ---- Build complete for OSX
if "%ALL%" == "" goto :eof
echo.

:Linux
echo ::::::::::::: Building for Linux :::::::::::::::::
ssh %LINUXHOST% "cd %LINUXBASE%%RELPATH%; make %PROJECT%" 2>&1 | tools\GccToVss.pl %ACTPATH%
echo ---- Build complete for Linux
if "%ALL%" == "" goto :eof

The batch file contains a few hardcoded paths. They should be easy enough to modify to suit your environment.

This is as close as I can get to building directly from Windows itself. There is the possibility of using cross platform compilers on Cygwin to produce OSX and Linux binaries, but they are very complicated to setup. Plus I prefer to build on the native platform itself using the compilers provided.

Hopefully this article helps other people build Linux and OSX binaries from within Visual Studio.

Adding Linux and command line building

Originally I wrote the projects here for Windows and OSX only. That was because I wanted to use Visual Studio and XCode IDEs and produce project files for them. Linux doesn’t really have an equivalent IDE, the standard way to build is via the command line using make. I didn’t like missing out building Linux, as it is part of the main 3 desktop OSes (Windows, OSX, and Linux). So I created a makefile that would work for Linux. It also worked for OSX as that has make.

I also decided I didn’t really like XCode’s IDE environment for building. It was rather tedious, and I was only using it to build, I use Visual Studio to edit the files. So I decided to scrap the XCode project files. I also didn’t like the way they were directories when viewing in Windows.

Now I had a convenient makefile for building in OSX and Linux, I decided I wanted to include Windows in that. Visual Studio comes with nmake, which is not very compatible with gnu make which Linux and OSX has. So I decided to ignore it and use make from Cygwin. Additionally this allowed me to easily compile using the Windows DDK libraries rather than the SDK ones with Visual Studio. I like to do this for the following reason: msvcrt.dll

Back with Visual Studio 6, when you compiled against the DLL version of the C Run-time library (CRT) it linked against msvcrt.dll. This is a DLL that has been shipped with almost every version of Windows, and still is in Windows 8. However after Visual Studio 6, Microsoft decided that you shouldn’t do this, and instead link to specific versions of the CRT that come with Visual Studio. The problem is these DLLs are not shipped standard with the Operating System. Microsoft says to distribute them with its “handy” tool. However I like single .exe programs that don’t depend on anything. For that reason I link with the static runtime library when using Visual Studio. That way it doesn’t not depend on a DLL, it does however increase the executable size. I also like small exe files.

Despite Microsoft saying you shouldn’t like to msvcrt.dll anymore, it turns out that they do for all their tools. The WinDDK (or WinWDK) libraries bind to msvcrt.dll for its usermode building. While WinDDK is used for Kerenl mode development, it also has a lot of usermode libraries. Personally I like to use the Visual Studio compiler, but with the WinDDK libraries, and falling back to the VS ones for the libraries that the DDK does not have.

This can be setup in the IDE using the include paths and library paths in the project files. However I want to keep those simple and standard. Not everyone is going to install the WinDDK. So I am leaving the project files using static linking with Visual studio libraries. However I extended the makefile to build using WinDDK libraries when building in Windows.

Seen I was using Cygwin for the make tool (and I always have cygwin installed), I thought I’d add the option of compiling using gcc in Cygwin, and producing Cygwin .exe files. These however are 32bit files as Cygwin is still only 32 bit. I also added to the make file the option to create 32bit Windows exe targeted for Windows XP. The x64 ones are targeted for Vista. When I release projects here in the future I will include x64 binaries for Windows, OSX, and now Linux. I won’t include either cygwin or x86 Windows, I just like to verify I can compile for them.

The following is my template makefile. This is fairly rough at this stage. For Windows it looks to see if the Visual Studio environment variables are set (which are set if you run the VS command prompt). Otherwise it uses Cygwin

OUTDIR = Bin/$(PLATFORM)
INCLUDES = -I ./lib

all: Md5String Sha1String

Md5String: dir
	$(COMPILE) projects/Md5String/Md5String.c lib/LibMd5.c lib/LibSha1.c lib/LibSha256.c lib/LibSha512.c $(TAIL)
	$(STRIP)

Sha1String: dir
	$(COMPILE) projects/Sha1String/Sha1String.c lib/LibSha1.c $(TAIL)
	$(STRIP)

dir:
	@mkdir -p $(OUTDIR)

###### Setup build parameters #####
ifeq ($(OS),Windows_NT)
    ifdef VSINSTALLDIR
        ifeq ($(Platform),X64)
            PLATFORM = Windows
        else
            PLATFORM = WindowsX86
        endif
        INTDIR = Build/$(PLATFORM)/$@
        COMPILE = @echo & echo ::::: Building $(PLATFORM) $@ & mkdir -p $(INTDIR) & cl $(INCLUDES) /nologo /Ox /Oi /Ot /GL /MD /W4 /WX /Fe$(OUTDIR)/$@ /D_CRT_SECURE_NO_WARNINGS /Fo$(INTDIR)/ 
		TAIL=/link /RELEASE
    else ifneq (,$findstring /cygwin/,$(PATH))
        PLATFORM = Cygwin
        COMPILE = @echo ::::: Building $(PLATFORM) $@ & gcc $(INCLUDES) -O3 -Wall -Werror -o $(OUTDIR)/$@
        STRIP = @strip $(OUTDIR)/$@
    else
        PLATFORM = None
        COMPILE = echo
        STRIP = 
        $(error Windows requires VS environment, or Cygwin)
    endif
else
    PLATFORM = $(shell uname)
    ifeq ($(PLATFORM),Darwin)
        PLATFORM=OSX
    else ifeq ($(PLATFORM),Linux)
        PLATFORM=Linux
	else
        $(error Unsupported platform. Non Windows platform support: OSX and Linux)
    endif       
    COMPILE = @echo ::::: Building $(PLATFORM) $@ & gcc $(INCLUDES) -O3 -Wall -Werror -pthread -o $(OUTDIR)/$@
    STRIP = @strip $(OUTDIR)/$@
    TAIL = -lm
endif

In order to build Windows, you need to a command prompt that is setup correctly. I made a little batch file that calls the VS batch file for setting up environment variables, then sets the CL environment variable to include link path to WinDDK libraries. The batch file assume the default locations for Visual Studio 2010, and WinDDK 7.1

@echo off
call "c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcvarsall.bat" amd64
set CL=/link /LIBPATH:C:\WinDdk\7600.16385.1\lib\wlh\amd64 /LIBPATH:C:\WinDdk\7600.16385.1\lib\crt\amd64 /subsystem:console,6.0
cmd /k title Build Environment for Windows Vista x64

I moved the Visual Studio project files (.vcxproj and .vcxproj.filters) into a seperate directory called VsProjectFiles. This keeps the source directories completely free of extra files that aren’t .c or .h. The property files are also in there. When compiling with Visual Studio from the sln file it uses purely the Visual Studio libraries, and statically links against the CRT.

 

Type 3 and 5 GUIDs

This article talks about GUIDs, and in particular type/version 3 and 5 GUIDs (or UUIDs).

UrlGuid.zip – Public domain C source code for generating type/version 3 and 5 GUIDs. Includes x64 executables for OSX and Windows. This is free and unencumbered software released into the public domain.

I love the concept of GUIDs (or UUIDs). They are 128 bit IDs in a standardised format that are unique across all namespaces. It is somewhat unclear as to whether the name is actually GUID (Globally Unique Identifier) or UUID (Universally Unique Identifier). The best explanation of the difference is that Microsoft use the term GUID, and generally UUID used else where. But they both refer to the format as defined by RFC4122 (“A Universally Unique IDentifier (UUID) URN Namespace“). Microsofts main contribution seems to have been to add the curly braces around the string that is commonly seen. I am going to use the term GUID from now on.

Most people have probably seen a GUID, they look like this: {855273c6-bca6-439e-8880-7a875c4394ed}. Sometimes they don’t have curly braces, and sometimes they are uppercase, and occasionally mixed case. In all cases they are just a string representing a 128 bit number in big endian format. Due to the fact it was designed by committee it has a lot more irrelevant complexity than is required. They could have simply displayed the GUID as one long hex string, instead they decided to break it up into groups of 4, 2, 2, 2, and 6 bytes. RFC4122 defines the 128 bit structure into various elements. However the fields really only make sense in a type 1 GUID. For the most part no one cares about any of the fields and it should be treated as one single 128 bit number.

The interesting thing about the GUID is that every GUID created should be unique without requiring a central registration authority. It gets this property mainly from the fact that 128 bits make up a large space and that if you randomly pick a 128 bit number it is incredibly unlikely anyone else will pick the same. The most common GUIDs seen are type (or version) 4 GUIDs. 122 bits of a type 4 GUID are purely randomly generated. There are 4 bits set as the type number, and another 2 bits set as the RFC4122 variant. The version/type number of every GUID can easily be seen, it is the 13th hex digit (or the 1st digit in the 3rd group). As you can see from the example GUID I showed before {855273c6-bca6-439e-8880-7a875c4394ed}, the version value is 4.

Type 1 GUIDs, identified by a 1 value in the 13th digit, have a more structured format. Rather than use pure random to hope that the same GUID is not generated twice, a more defined procedure is used. The last block of 12 digits is set as the 48 bit MAC address of the computer that generated the GUID. As MAC addresses are supposed to be unique, this partitions the space into unique groups for each computer. Additionally a high precision time value is used to partition the space based on time. Then there are some random bits as well as sequence values to ensure the same computer doesn’t accidentally generate the same GUID twice even if its clock gets messed up.

While Type 1 GUIDs use a good scheme to avoid collisions, they are unpopular because they give away information. In particular they give away when the GUID was generated and, more significantly, the MAC address of the computer that generated it. In fact it was the use of Type 1 GUIDs in Microsoft Office documents that helped lead to the arrest of the “Melissa Worm” author over a decade ago. That was a high profile case and made people very uneasy about the use of GUIDs which could track people. For example I may use my computer for two distinctively different roles and not wish to have things produced for one role in any way relatable to the second role.

After the Melissa Worm incident MS changed to using Type 4 GUIDs. As these are completely randomly generated, it can not be determined if two GUIDs were created by the same computer or person.

The most common use of GUIDs is to have a unique ID for some new item just created. It could be a database entry, it could be a ticket number, or anything. For this use a Type 1 or Type 4 GUID is perfect, and almost all the literature on the Internet refers only two these two types. However there are cases where you want a unique, but repeatable, identifier for an object. For example URLs.

A URL is a unique Id. However because its size is variable, it may not be convenient to use as an UD. The obvious thing to do is to take a hash of the URL and use that as an ID value, now you have a fixed size ID. There are two versions of GUIDs that do exactly this. The advantage of using these over straight hashes is that it remains in the same namespace as the rest of GUIDs, while guaranteeing not to clash with Type 1 and 4 ones. The two versions are 3 and 5.

To create a version 3 or 5 GUID you first need a namespace ID. This is a GUID itself. You can generate your own GUID to represent your name space. This is to prevent items in different name spaces but having the same name generating the same GUID. For example I might want a namespace which has colours as the values. eg my name space consists values from the set { red, blue, green, yellow }. Another name space may be that of all English words, which clearly would contain the values in my colour set. The same GUID should only be generated for “red” when used in the same namespace.

The 128 bits of the name space ID is prepended to the named value (eg “red”) and either an MD5 or a SHA1 hash performed. Note the named value does not have to be text. It is whatever the name space ID represents, it can be of any form including a hash itself, or a GUID. The output of the hash is used mostly as the GUID. However only 122 bits of the GUID value can be arbitrarily set. The 13th hex digit of the hash is replaced by the version/type value. If the hash is MD5 then the value 3 is used, if the hash is SHA1 then the value 5 is used. Additionally the top two bits of the 17th hex digit (1st digit in 4th group) are set to 10 (in binary). This defines the “variant” of the GUID as RFC4122 compliant. This means that the 17th hex digit of any RFC4122 GUID (ie almost all of them) will only have 4 possible values: 8, 9, a, b.

RFC4122 defines four standard name space IDs.

NameSpace_DNS = {6ba7b810-9dad-11d1-80b4-00c04fd430c8}
NameSpace_URL = {6ba7b811-9dad-11d1-80b4-00c04fd430c8}
NameSpace_OID = {6ba7b812-9dad-11d1-80b4-00c04fd430c8}
NameSpace_X500 = {6ba7b814-9dad-11d1-80b4-00c04fd430c8}

So for example, to make a type 3 GUID from the dns name waterjuice.org, the 128 bit value of NameSpace_DNS (in big endian) is prepended to the ascii string “waterjuice.org” and then a MD5 hash calculation made. This results in the MD5 hash: bed9e1fe7b92c85c82aa104c8c45bccf. The 13th digit is replaced with a 3, and the top 2 bits of the 17th digit are set to 10 binary (In this case those bits already were set that way). The final result, displayed as a GUID, is:

{bed9e1fe-7b92-385c-82aa-104c8c45bccf}

Using SHA1 instead of MD5 the following type 5 GUID represents waterjuice.org:

{c1a7f72b-f2ee-5455-990c-308ae1450a0e}

Unfortunately RFC4122 contains an error in the example it provides for creating a type 3 GUID. Appendix B gives an example of a type 3 GUID from the DNS name “www.widgets.com”. It incorrectly displays the result as {e902893a-9d22-3c7e-a7b8-d6e313b71d9f}. This was corrected in errata 1428. The correct result is {3d813cbb-47fb-32ba-91df-831e1593ac29}. Unfortunately the incorrect value is often requoted on other sites on the internet.

There are countless programs that generate type 1 and type 4 GUIDS. There are many online generators that you can use. Additionally OSX comes with uuidgen which is a command line tool to create a type 4 GUID. While Visual Studio has a uuidgen.exe which will generate type 1 or type 4 GUIDs. There are, however, very few examples of generators for type 3 and type 5. So I have written a program in C to create type 3 and type 5 GUIDs from either a DNS name or a URL.

UrlGuid.zip – Public domain C source code for generating type/version 3 and 5 GUIDs (UUID). Includes x64 executables for OSX and Windows.

This is free and unencumbered software released into the public domain.

Example usage:

>UrlGuid.exe http://waterjuice.org/2013/06/type-3-and-5-guids/
{cf0aff65-65c3-35a2-b04f-923e1f8bf1f3}
{3a8df732-bc01-57ab-bf7f-94461ee1057b}

As there are few examples on the internet (and some of them are wrong) I will include some type 3 and 5 GUIDs for some common DNS names and URLs.

google.com

    {9a74c83e-2c09-3513-a74b-91d679be82b8}
    {64ee70a4-8cc1-5d25-abf2-dea6c79a09c8}

www.wikipedia.org

    {ea68c8dd-6a38-39ac-8fa4-d55bfb7018fd}
    {02c295f5-54a8-5d29-8d1f-b619216b20c0}

unlicense.org

    {a209c2da-4d39-395e-97b3-38e58d800777}
    {d89fa53a-36cb-5b1c-819a-8b3327182ced}

http://en.wikipedia.org/wiki/Globally_unique_identifier

    {734d5d9c-ea54-3ef9-8d40-fa7db2281e24}
    {d18e9b50-d196-558d-ade3-6f54da5608fb}

https://www.google.com/webhp?hl=en&tab=ww

    {5404423a-b81e-323c-be9b-d3446d606b35}
    {c2b80bf8-7e38-54d2-b0d7-d842eb3e173d}

http://unlicense.org/UNLICENSE

    {456e7994-3b66-316f-839c-68b4b30b5b85}
    {b889eec7-f11a-592e-899b-382086d69067}

SHA-1

MD5 is a widely used hash system; it was invented in 1991. In 1996 a flaw was found in the system making it vulnerable to certain types of attacks, although it wasn’t until 2004 that a more serious flaw was found and the system declared “broken”. SHA-1 was created as a replacement for MD5. It uses similar principles, but is a more conservative design. It outputs a 160 bit hash rather than 128 bit. Due to the flaws found in MD5, SHA-1 was promoted. Over the years weaknesses in SHA-1 have been found, however it is still better than MD5. As of current MD5 is considered completely broken, whereas SHA-1 is still reasonable against the most concentrated attacks. The best know attack requires 2^61 SHA-1 calculations. The computing resources required to perform this are still enormous.

MD5 and SHA-1 are probably the most common hashes systems in use. SHA-1 having replaced MD5 in many cases. The larger hash size alone makes it better as the chances of collisions are further reduced. SHA-1 does come at a cost though. It takes more work to calculate a SHA-1 hash than an MD5 hash.Also the output hash takes up more storage. However if only 128 bits are wanted of a hash, it is acceptable to take the first (or in fact any) 128 bits of a SHA-1 hash. This is still better than using MD5.

I wrote the Proof-Of-Work program using MD5 as it was fast. However I want to build up a library of public domain C source code of several hash functions. SHA-1 is the next obvious one. So I present here source code for SHA-1 hashes, including a program to calculate a SHA-1 hash from the command line.

Sha1String.zip – Contains full source code for SHA-1 and the program Sha1String. Includes x64 binaries for OSX and Windows.

The functions and types in LibSha1.h match the ones in LibMd5.h presented earlier. This makes it very easy to substitute one system with the other. When I add more hash algorithms in the future, they will also follow the same format.

This is free and unencumbered software released into the public domain.

 

Which hash should I use?

As far as secure message hashes go, MD5 and SHA1 are considered broken. MD5 is very broken, while SHA1 is quite damaged. In either case they are not good choices for secure message hashes in uses such as digital signatures. But digital signatures is only one use of hashes. There are other situations where the security implications of a hash are not important.

If, for example, you wish to use a hash to verify data transfer then in most cases any of the hash algorithms will be sufficient. The weaknesses in MD5 and SHA1 require a malicious adversary to expend a lot of computer effort in creating a hash collision. If you are simply trying to verify data has not been corrupted while transferred over a medium, then this is not a problem. Take for example a protocol where a data buffer is transmitted followed by a hash of that buffer. There is nothing in this protocol that prevents a malicious person interfering with the data to change it what they want. They can simply change the hash value as well as the data. The only thing the protocol is protecting is accidental corruption of the data, such as line noise. The odds of random data corruption producing an identical hash are so astronomically unlikely that it can be consider 100% guarantee of message integrity for all for all intents and purposes. In which case the best choice of hash algorithm is most likely to be the one that is most efficient and fast.

As MD5, SHA1, and SHA256 all use 32 bit operations, with each successive hash algorithm being more complicated than the preceding one, it seems likely that MD5 will be the fastest and SHA256 the slowest. SHA512 uses 64 bit operations and therefore processes twice as much data at a time. Rather than just guessing which one is faster, it is fairly easy to write a speed testing program. Remember its not just the algorithm itself, its the implementation that affects the speed. Two MD5 hash implementations can be very different in speed.

I have written a speed test for the four hashes I have now.

HashSpeedTest.zip – Full C source code and x64 binaries for OSX and Windows.

I measure the speed of the hash algorithms in two ways. First the number of hashes per second of small data (that is data that only requires on hash calculation). This measures the rate that would be used in something like the Proof of Work program. The second measurement is the amount of data per second that can be hashed on large data buffers.

Depending on the application either measurement might be more relevant. For example when hashing files, in general it will be the amount of data per second that is relevant for affecting overall speed. But in some applications, such as Bitcoin hashes, then it is the number of hashes per second that is relevant.

Here is the output running the OSX version on a 2.6 GHz Intel i7

MD5     6.498 MHashes/s   463.16 MBytes/s
SHA1    2.409 MHashes/s   338.21 MBytes/s
SHA256  2.254 MHashes/s   156.83 MBytes/s
SHA512  1.520 MHashes/s   244.93 MBytes/s

Note, this is single threaded. Depending on the application (such as proof of work in Bitcoin) the rate can be increased by the number of processors available. However for calculating the hash of a single file the calculations can not be performed in parallel as it is a serial operation.

It can be seen that MD5 is clearly the winner of speed in both counts. So if your application does not require secure hashes then MD5 could be a good choice. Remember these results are just from this implementation compiled with Xcode. While looking for MD5 algorithms I first was using a different implementation that was half the speed of this one.

Intrestingly the otimisation of the compiler plays a big role in the result. The difference between Xcode and MS Visual Studio is interesting in that it is not consistent between algorithms. Because they optimise in different ways some operations are faster in one system, but others faster in the other. Running the Windows version of the program on the same computer gave the following results:

MD5      5.995 MHashes/s    401.29 MBytes/s
SHA1     3.317 MHashes/s    390.18 MBytes/s
SHA256   2.562 MHashes/s    184.98 MBytes/s
SHA512   1.618 MHashes/s    289.01 MBytes/s

While MD5 performed worse when compiled with MS Visual Studio, SHA1 was better.

Despite it being 12 years since SHA2 came out, MD5 and SHA1 are still the main standards used. And even when SHA3 arrives in the near future (the algorithm has already been picked) its likely we’ll still see MD5 and SHA1 for quite some time in the future.

SHA-2

One of the goals I have with this blog is to build up a collection of public domain C libraries for various cryptographic functions. Already I have MD5 and SHA1, the next obvious one is SHA2. Once again I have made an example program and compiled for x64 on OSX and Windows that calculates SHA256 and SHA512 hashes of a string from the command line.

Sha2String.zip – Full source code and Windows and OSX x64 binaries.

SHA1 became a replacement for MD5 as MD5 had weaknesses. Unfortunately SHA1 also has weaknesses, but it took longer before they came apparent. SHA2 was designed to replace SHA1 and was created in 2001. Unfortunately SHA2 was designed by committee, and as often is the case when designing a replacement the end result is far more complicated than required to fix the original problem.

MD5 produces 128 bit hashes. SHA1 produces 160 bit hashes. It would seem logical SHA2 would be an algorithm that produced a fixed hash length, most likely 256 bit. Instead SHA2 is a set of functions producing six different hashes, with four different sizes.

SHA256 is the name given to a SHA2 function that produces a 256 bit hash. Like MD5 and SHA1 this function uses 32 bit operations and works well on a 32 bit processor. The same algorithm is used but with 64 bit operations instead of 32 bit. This is SHA512. It produces a 512 bit hash. There are a few other differences such as the number of rounds and the initial values.

In my opinion if SHA2 had to have more than one hash size it should have stopped here with two functions, a 256 bit one and a 512 bit one. Besides if you need a smaller size you can simply truncate the final output of the hash to a smaller size. However the SHA2 committee decided that two additional sizes were required, 224 bit and 384 bit. Rather than just truncating the output of the larger hashes, they also changed the initial values used in the hash calculations. So SHA224 is a truncated version of SHA256 but with different initial values and the same goes with SHA384 being a truncated version of SHA512. The different initial values don’t add anything to the strength of the hashes, they just make it impossible to create a SHA224 hash from a SHA256 hash, you have to perform the entire calculation again.

To me the 224 and 384 variations are worthless. If you have an application that absolutely needs 384 bits then use a SHA512 and truncate the result to 384 bits. Having a different official standard doesn’t offer anything useful. So now there are four SHA2 functions: SHA224, SHA256, SHA384, and SHA512. Unfortunately it gets worse, there are actually two more.

Because SHA512 works on 64 bit values, it is more efficient than SHA256 on a 64 bit processor as it works through the data in larger chunks at a time. However on a 32 bit processor the code takes about four times as long. When SHA2 was invented 64bit processors were not common place in desktop computers, however now they are standard. On a 64 bit processor a SHA512 calculation is faster than a SHA256 because of the higher data throughput. In 2012 two more functions were added to the SHA2 family: SHA-512/224 and SHA-512/256. Both these functions use the SHA-512 function, but again with different initial values, and the final output truncated to 224 or 256 bits.

As of this writing all SHA2 functions are considered secure, and not broken.

As I personally only like SHA256 and SHA512, I will ignore the other four variations and not add them to my hash library. I have written SHA256 and SHA512 as independent modules. You can take either one without requiring the other.

The functions are defined with the same prototypes as MD5 and SHA1 for easy compatibility. Once again I have produce string to hash programs. In this case two programs: Sha256String and Sha512String which can be used to verify the hash functions.

Sha2String.zip – Source code and binaries.

This is free and unencumbered software released into the public domain.