Daily Archives: June 10, 2013

Threading and synchronisation

Not that many years ago most desktop computers had single core processors. This means the processor can execute just one instruction at a time. Of course operating systems have been multi-threaded for many years to allow users to run more than one program at once. A lot of programs written in Windows were multi threaded, as it is a convenient method for programming interactive GUI applications. However if you were writing a program to perform long calculations, you could quite reasonably write it as a single threaded application. After all you will get close to 100% CPU usage out of your single thread.

Most computers these days have multi-core proce parallized ssors, or multiple processors. This means if you run your same single threaded application that used to take 100% on a uniprocessor system, it will now only get a fraction of the total computing power. On an 8 core system a single thread can only get 12.5% of the CPU. In order to get full performance you need as many threads as there are cores or processors. You will also need to have a task that can be broken up into parallel components. For example calculating a MD5 of a single file is a serial task. You can not calculate the next part of the hash until you have performed the calculation of the previous part. If instead your task was to calculate the MD5 sums of each block of the file, then this is can easily be performed in parallel. As each block calculation is independent of each other.

If two tasks are entirely independent of each other they can be performed in separate threads with no interaction between them. In general though tasks will be related and at some points need to interact with each other. Some data may need to be shared between the threads. At this point synchronisation becomes important. With a single threaded program you know only one thing is changing any data at any one time. With multiple threads working on the same data set then one thread can corrupt the results of another thread unless care is taken to keep them synchronised.

The simplest form of thread synchronisation is the spinlock. A spinlock is a shared data value that can be compared and updated with an atomic instruction. An atomic instruction is one that has on intermediate states that another thread could see. The common atomic instruction used to implement spinlocks is the compare and exchange instruction. The x86 family of processors has the cmpxchg instruction, which when used with the lock prefix becomes an atomic instruction. This means while one thread is performing this instruction, no other thread will be see of change the value of the spinlock.

C does not provide any threading or synchronisation systems as part of the language. In unix base systems there is pthreads which implement threads and synchronisation methods, while Windows has its own system. I wrote a very simple module that provides basic threads and spin locks that works on Windows and OSX.

The spinlock is defined as typedef volatile long THREADSPINLOCK; 

The volatile keyword instructs the C compiler not to make any optimisation assumptions about the state of the variable, and to check it every time it is requested.

A spinlock is created an initialised simply by declaring the value and assigning it 0. A spinlock will usually be a global value as it has to be seen by more than one thread. I have one function for acquiring the spinlock.

////////////////////////////////////////////////////////////////////////////////
//  ThreadSpinLockAcquire
//
//  Acquires the thread spinlock. This will wait (spin) until the lock is
//  acquired. Use this to protect a resource from multi thread access. Do not
//  hold the spinlock for long periods of time.
////////////////////////////////////////////////////////////////////////////////
void
    ThreadSpinLockAcquire
    (
        THREADSPINLOCK*     SpinLockPtr
    )
{
#if defined( _WIN32 )
    while( InterlockedCompareExchange( SpinLockPtr, 1, 0 ) )
    {
        Sleep( 0 );
    }
#elif defined( __APPLE__ )
    while( !__sync_bool_compare_and_swap( SpinLockPtr, 0, 1 ) )
    {
        pthread_yield_np( );
    }
#endif
}

In Windows I use the InterlockedCompareExchange function, while in OSX I use __sync_bool_compare_and_swap. These functions perform the same task, though inconveniently take the parameters in a different order.

The spinlock acquire function runs in a loop. In an atomic operation it compares the value of the spinlock with 0 and if it is 0 it changes the value to 1, otherwise it leaves it alone. If the value was 0, it is now 1 and the loop exits. If the value is 1 to start with the loop then yields execution for the thread and then tries again. It keeps on trying until the value becomes 0, then it “acquires” it by setting it 1 and leaving the loop.

Whenever a thread returns from the ThreadSpinLockAcquire function it is the only thread that currently posses the spinlock. All other threads attempting to acquire it will spin until it is released.

The release function is simply setting the value back to zero

//////////////////////////////////////////////////////////////////////////////////
//  ThreadSpinLockRelease
//
//  Releases the thread spinlock. Do not call this unless you have acquired the
//  spinlock.
//////////////////////////////////////////////////////////////////////////////////
void
    ThreadSpinLockRelease
    (
        THREADSPINLOCK*     SpinLockPtr
    )
{
    *SpinLockPtr = 0;
}

The spinlock mechanism is very simple to implement, it also does not require code to initialise unlike most synchronisation mechanisms. It is however inefficient. While other threads are waiting to acquire the spinlock they “spin”. Meaning they are constantly running and checking the value. To improve performance we put a thread yield instruction in there. When there are multiple threads waiting on a spinlock, it is just luck which thread will get it once the lock has been released. OS provided synchronisation methods will generally have a better scheduling system to determine who gets the lock next. Also the threads don’t get scheduled in until they are ready to receive the lock. So for highly contented locks, a system provided mechanism such as a Mutex should be used. But the spinlock is simpler and if used right can be sufficient.

The higher the contention of a lock the more inefficient it becomes. If you have 8 threads spending most of their time waiting on a lock, then you haven’t gained much with multiprocessing. Ideally a lock is rarely contended, which means thread execution is not interrupted. A spinlock should be held for the shortest amount of time possible. Perform all the operations you can before acquiring it. For example if you were entering an element into a linked list. It is best to allocate the new element first, fill in all its entries. Then acquire the spinlock, attach the element to the global linked list, and then release. Don’t perform the allocation and the setting of the entries within the spinlock as that is holding up other threads unnecessarily.

A C program starts (usually in its main function) running one thread. In order to become multithreaded you have to create more threads. This is is platform specific. Unix like platforms use pthreads, Windows uses its own thing (CreateThread). There are a lot of functions relating to threads. Once you create one, you can monitor them, wait for them to close etc. For my task I just simple want to create threads and let them run until the program ends. I wrote a simple wrapped for Window’s CreateThread and pthread’s pthread_create

////////////////////////////////////////////////////////////////////////////////
//  ThreadLaunchNewThread
//
//  Launches a new thread. Passes the pointer Context through to the new thread
//  function. Returns zero value on failure, or 1 if success
////////////////////////////////////////////////////////////////////////////////
int
    ThreadLaunchNewThread
    (
        ThreadStartFunction     StartFunction,
        void*                   Context
    )
{
#if defined( _WIN32 )
    HANDLE                          threadHandle = 0;
#elif defined( __APPLE__ )
    pthread_t                       threadHandle = 0;
    pthread_attr_t                  attributes;
    int                             errors;
#endif
    uint32_t                        success = 0;
    InternalThreadStartParameters   startParams = {0};

    startParams.StartFunction = StartFunction;
    startParams.Context = Context;
    ThreadSpinLockAcquire( &startParams.StartSpinLock );

#if defined( _WIN32 )
    threadHandle = CreateThread(
        NULL,
        0,
        InternalThreadStart,
        &startParams,
        0,
        NULL );
    if( NULL != threadHandle )
#elif defined( __APPLE__ )
    pthread_attr_init( &attributes );
    errors = pthread_create(
        &threadHandle,
        &attributes,
        InternalThreadStart,
        &startParams );
    if( 0 == errors )
#endif
    {
        // The thread will release the spinlock, so attempt to reacquire it. This will mean we
        // will be in sync.
        ThreadSpinLockAcquire( &startParams.StartSpinLock );

        // Close the handle as we do not need it anymore. Thread has been created
#if defined( _WIN32 )
        CloseHandle( threadHandle );
#elif defined( __APPLE__ )
        pthread_attr_destroy( &attributes );
#endif
        success = 1;
    }
    else
    {
        // Failed to create thread
        success = 0;
    }

    return success;
}

////////////////////////////////////////////////////////////////////////////////
//  InternalThreadStart
//
//  This is the function stub that is called by The windows CreateThread
//  function. This will launch our thread.
////////////////////////////////////////////////////////////////////////////////
#if defined( _WIN32 )
static
DWORD
WINAPI
    InternalThreadStart
    (
        LPVOID          Parameter
    )
#elif defined( __APPLE__ )
static
void*
    InternalThreadStart
    (
        void*            Parameter
    )
#endif
{
    InternalThreadStartParameters*      parameters
        = (InternalThreadStartParameters*)Parameter;
    ThreadStartFunction                 startFunction = 0;
    void*                               context = 0;

    startFunction = parameters->StartFunction;
    context = parameters->Context;

    // Release spinlock, After this we can no longer access parameters.
    ThreadSpinLockRelease( &parameters->StartSpinLock );

    // Call start function
    startFunction( context );

    // Thread ends    
    return 0;
}

This function ThreadLaunchNewThread takes two parameters. The first is the thread function you want the thread to start with. The second is an arbitrary pointer that will be passed to the thread function. The thread function has a simple prototype:

typedef
void
    (*ThreadStartFunction)
    (
        void*       Context
    );

The thread will run until this function returns.

I now have all the components I need for my article on Proof-of-Work. I added just one more function to the thread module. A function that lowers the priority of the program. I want my multithreaded program to use the maximum performance of the computer. I don’t however want to disable my computer from any other use. So I want to set the priority low so that all other programs get priority over it. That way it just uses up the CPU cycles that are otherwise not being used, rather than hogging the CPU itself.

//////////////////////////////////////////////////////////////////////////////////
//  ThreadSetProcessPriorityToBackground
//
//  Sets the process priority to Background mode
//////////////////////////////////////////////////////////////////////////////////
void
    ThreadSetProcessPriorityToBackground
    (
        void
    )
{
#if defined( _WIN32 )
    // Set priority of process to background
    SetPriorityClass( GetCurrentProcess(), PROCESS_MODE_BACKGROUND_BEGIN );
#elif defined( __APPLE__ )
    setpriority( PRIO_PROCESS, 0, 10 );
#endif
}

This function will lower the priority of the the process. In Windows it uses a PROCESS_MODE_BACKGROUND_BEGIN which is only available since Vista. For XP and earlier you can use BELOW_NORMAL_PRIORITY_CLASS or IDLE_PRIORITY_CLASS. In OSX this applies “nice 10″ to the process.

In the next article I will put this together with the MD5 and RC4 routines to create my Proof-of-Work example program.

 

Advertisements

Proof of Work

This article is about the concept of Proof of Work. I have written a program in C demonstrating the concept. It is downloadable here: ProofOfWork.zip

The term Proof of Work in computing is an interesting concept. It means producing some output that other people can verify would “have taken a certain amount of work” to have accomplished. The important part is that the verification is easy, while the actual work is hard.

There are many computational things that take a long time to perform, but easier to verify the result. For example finding a prime number. It is harder to find a prime number than it is to verify that a given number is prime. So a proof of work system could be worked using this. The required work could be to find a random prime number greater than some large threshold. In order to “prove” you have performed the work you present the prime number. I can verify that it is a prime number without as much work as it took to find it. Therefore “proving” that you probably spent a measurable amount of work finding it. This would not make a very good scheme, for one the threshold would have to be changed all the time to avoid repeats, also verifying a number is prime is still relatively expensive and becomes more difficult as the numbers get larger.

A better proof of work system is one involving hash calculations. The nature of a good hash algorithm is that it is impractical to influence the output of the hash value. Basically any slight change in the input will cause a completely random-like hash value. A hash value can be considered a large integer value. For example MD5 produces 128 bit hashes, while SHA2 typically produces 256 or 512 bit hashes. For the remainder of the article I am going to use MD5, but note MD5 is NOT a good hash algorithm. However it is fast and still demonstrates my point. Though for an actual system, the hash would have to be replaced with a more secure hash such as SHA2.

A hash looks like a random number. If we hash any arbitrary data  we will get a 128 bit number that appears to be random. If we change the original data a bit and rehash we will get another 128 bit random looking number. Assuming the hash is good (MD5 isn’t, but lets ignore that for the moment) there is no way I can manipulate the data in order to produce a hash value that I want. Basically any change I make will cause a new random hash. Pretty much like rolling a 2^128 sided die.

In the case of a regular six sided die, fair throwing of the die will produce a random value from 1 to 6. If the “desired” outcome of a die throw is a 6, then I have to keep throwing the die until I get a 6. I might get it on the first outcome, I might have to throw it 1000 times before I get a 6. But probability theory suggests that there is a bit over 50% chance that I will get it will get it within 3 throws. Taken another way, if we have thousands of people in a room who have to roll a die until it is a 6, and then report the number of throws they did, then we will see a nice statistical distribution of the numbers of throws required. About 1/6 of the people will throw a 6 on the first try. The remaining 5/6 people will have to throw it more times, out of which 1/6 will get it in there next try, this is 5/36 (or 13.9%) of all the people who took exactly two throws. The following table shows the proportions of people getting the throw after n throws, and also lists the accumulation (ie people who got it within n throws)

Throws     % On this throw     % By this throw
1          16.7%               16.7%
2          13.9%               30.6%
3          11.6%               42.1%
4          9.6%                51.8%
5          8.0%                59.8%
6          6.7%                66.5%
7          5.6%                72.1%
8          4.7%                76.7%
9          3.9%                80.6%
10         3.2%                83.8%

As a rather contrived example, lets make a proof of work system based on rolling a die and trying to get a six. The “work” is rolling the die until it is a six. The “verification” is looking at the number on the die and seeing that it is a six. If someone presents a die rolled as 6 I can verify it easily without much work, I can also make some assumptions based on probability of how much work was required to get that. Although you may have only had to roll it once, it is more likely that you had to roll it more times. There is an almost 70% chance that it took you at least three throws to achieve it. Obviously the die rolling scheme is not particularly good as the verification system is rather flawed. Unless I actually watched you throw the die, I can’t tell that you didn’t just present me a die with the six side up and didn’t throw it at all. However in the case of our 2^128 sided die (aka the hash of a random value) I don’t need to watch the generation, as I can simply verify it from the original input.

In the case of the hash proof of work we can make a system similar to the die roll, but instead of rolling a die cube, a random value is hashed and the result of the hash is used as the die value. Instead of just six outcomes there are 2^128 outcomes (That is 340282366920938463463374607431768211456 which is a lot). Its rather unreasonable to specify a single value as the required result as it is incredibly unlikely that anyone would ever get it within the life of the universe. However we can set an arbitrary threshold. We specify a number that the hash must be greater than. For example we could set a threshold being a number where the first 40 bits must be all 1s (or 10 Fs when looking at it in hexadecimal). The work required is to keep on hashing random data until a hash is produced that is larger than this threshold (ie starts with 10 Fs). I can verify with one hash operation that the data you claim achieves this hash does in fact do so. I can also be relatively confident that it took a while to generate, as there would have had to be a great deal of hashes generated in order to find on that reached the threshold.

50% of all hashes start with the first bit being 1. 25% of all hashes start with the first 2 bits being 1, and 12.5% have the first 3. Just like the die roll calculation on the six sided die we can do the same for the hash outputs. Of course you could have come up with the hash that reached the threshold on the very first try, but the odds are against you. If you produce some data that has a hash value of the first 40 bits being 1s then I can feel reasonably confident that it probably took you 10s or 100s of billions of hash values in order to find that one. For example the MD5 hash of the string Xnpbsz2zBw  is fffffffff62bf72e14b9ee1b8b923835. This has the first 36 bits all 1s. It took my computer over 10 billion attempts before reaching that threshold, however it can be verified by a single hash operation now by anyone else.

The Bitcoin system uses a similar system of proof of work in order to control the generation of bitcoins. Bitcoin uses 256 bit SHA2 for its hash function, and the threshold is a value that the hash must be lower than. In other words starting with several 0s. As of the time of writing the current bitcoin threshold requires a hash with at least the first 55 bits zero.

Of course just having the requirement that any random data produces a hash that meets the threshold is not good enough for a proof of work system, as there is nothing that proves when the data was generated, or if its been used before. In the case of bitcoin, it is the current transaction block that must be hashed along with some random data. This means that once that block has been done, there is no use trying to reuse that value.

There is a system called HashCash which was devised as a method to slow spam. The concept was that each email sent would require a hashcash token which proves a certain amount of computer work was performed. The threshold was low enough so that it wouldn’t disrupt normal users, but would be a hinderance to spammers attempting to send millions of emails at once. Anti spam filters could verify the hashcash was good with just one single SHA1 calculation.

The whole concept of Proof-of-work fascinates me, and so I wrote a program to demonstrate it. This program requires the components I have talked about in the previous articles. Namely MD5 for the Hashing, RC4 for Random number generation, and Threading in order to get maximum performance out of a multi core CPU.

ProofOfWork.zip contains the full source code and also a Windows x64 executable and an OSX x64 executable for generating proof of work hashes.

The program is simple to operate

Syntax:
    ProofOfWork: <string> <numCharsAdded> <numThreads>

The first parameter is the base string. The second parameter specifies the number of characters that will be added to the base string to feed into the hash function. And the third parameters specifies how many threads will be launched to perform the calculations. To get maximum CPU usage, the number of threads should equal the number of CPU cores on the system.

Each thread seeds a RC4 output stream with some time values and thread addresses. The RC4 streams are used to produce random values that get appended to the string. The values are restricted to a 55 character alphabet, containing letters and numbers without certain confusing characters such as 0 and o and 1 and l etc. The program maintains two thresholds, a current lowest hash and a current highest hash. Everytime a hash is generated that reaches a threshold, the hash is printed and that becomes the new threshold.

Here is some sample output from the program

>ProofOfWork waterjuice.org: 8 8
waterjuice.org:vWJz6CBN  000000e099ef29e46a15bf1e9734ebc3
  Is Smaller (24.19)  0.036 GHashes  in 00:00:02 (h:m:s)
waterjuice.org:eE2yDHW6  0000009f3f1f7cc5b2038bbfc9ca5dec
  Is Smaller (24.68)  0.040 GHashes  in 00:00:02 (h:m:s)
waterjuice.org:cz4J7ams  ffffff1af55a568e5287ca52d5d007af
  Is Larger  (24.16)  0.054 GHashes  in 00:00:03 (h:m:s)
waterjuice.org:24874LFj  fffffffe5408c859e30a1f5e9b6b8ca4
  Is Larger  (31.26)  0.062 GHashes  in 00:00:04 (h:m:s)
waterjuice.org:dczbepJq  0000001a57ec47e8433f0042d1dedad7
  Is Smaller (27.28)  0.088 GHashes  in 00:00:06 (h:m:s)
waterjuice.org:L7tP3Cdk  000000096a5572842de52782e4b78ff1 
  Is Smaller (28.76)  0.420 GHashes  in 00:00:42 (h:m:s)
waterjuice.org:fmsWH6fL  000000030ae1ecdc25feafeb2cc2de53
  Is Smaller (30.39)  0.642 GHashes  in 00:01:05 (h:m:s)
waterjuice.org:BkB88FY6  00000002517de480c6b9c095027d18eb
  Is Smaller (30.79)  1.265 GHashes  in 00:02:11 (h:m:s)
waterjuice.org:H42UeFbU  ffffffffab3ec2782e3918082b39f579
  Is Larger  (33.59)  1.523 GHashes  in 00:02:38 (h:m:s)

Each line displays the text used to create the hash followed by the MD5 hash. It indicates if the hash was smaller or larger than the current threshold. It displays the number of GHashes (Billions of hash calculations) performed before the hash was found, along with the time taken. The program quickly puts out the first results, but as the thresholds get harder the time taken takes much longer.

The value in brackets is a measure of difficulty. The integer part of that value indicates how many of the first bits are all 0s or all 1s.

I have also experimented with various other thresholds besides just having a “large” or “small” hash value. For example a hash with the fewest or greatest number of 1 bits (ie the hamming weight of the hash). Or a hash with only numeral digits or only letter (a-f) digits.

I found the whole exercice a lot of fun, and maybe someone else will too. I have put my program and all its source code in the public domain. Feel free to use it how you wish and experiment with other it.

Download source and binaries: ProofOfWork.zip

This is free and unencumbered software released into the public domain.