Not that many years ago most desktop computers had single core processors. This means the processor can execute just one instruction at a time. Of course operating systems have been multi-threaded for many years to allow users to run more than one program at once. A lot of programs written in Windows were multi threaded, as it is a convenient method for programming interactive GUI applications. However if you were writing a program to perform long calculations, you could quite reasonably write it as a single threaded application. After all you will get close to 100% CPU usage out of your single thread.
Most computers these days have multi-core proce parallized ssors, or multiple processors. This means if you run your same single threaded application that used to take 100% on a uniprocessor system, it will now only get a fraction of the total computing power. On an 8 core system a single thread can only get 12.5% of the CPU. In order to get full performance you need as many threads as there are cores or processors. You will also need to have a task that can be broken up into parallel components. For example calculating a MD5 of a single file is a serial task. You can not calculate the next part of the hash until you have performed the calculation of the previous part. If instead your task was to calculate the MD5 sums of each block of the file, then this is can easily be performed in parallel. As each block calculation is independent of each other.
If two tasks are entirely independent of each other they can be performed in separate threads with no interaction between them. In general though tasks will be related and at some points need to interact with each other. Some data may need to be shared between the threads. At this point synchronisation becomes important. With a single threaded program you know only one thing is changing any data at any one time. With multiple threads working on the same data set then one thread can corrupt the results of another thread unless care is taken to keep them synchronised.
The simplest form of thread synchronisation is the spinlock. A spinlock is a shared data value that can be compared and updated with an atomic instruction. An atomic instruction is one that has on intermediate states that another thread could see. The common atomic instruction used to implement spinlocks is the compare and exchange instruction. The x86 family of processors has the cmpxchg instruction, which when used with the lock prefix becomes an atomic instruction. This means while one thread is performing this instruction, no other thread will be see of change the value of the spinlock.
C does not provide any threading or synchronisation systems as part of the language. In unix base systems there is pthreads which implement threads and synchronisation methods, while Windows has its own system. I wrote a very simple module that provides basic threads and spin locks that works on Windows and OSX.
The spinlock is defined as typedef volatile long THREADSPINLOCK;
The volatile keyword instructs the C compiler not to make any optimisation assumptions about the state of the variable, and to check it every time it is requested.
A spinlock is created an initialised simply by declaring the value and assigning it 0. A spinlock will usually be a global value as it has to be seen by more than one thread. I have one function for acquiring the spinlock.
//////////////////////////////////////////////////////////////////////////////// // ThreadSpinLockAcquire // // Acquires the thread spinlock. This will wait (spin) until the lock is // acquired. Use this to protect a resource from multi thread access. Do not // hold the spinlock for long periods of time. //////////////////////////////////////////////////////////////////////////////// void ThreadSpinLockAcquire ( THREADSPINLOCK* SpinLockPtr ) { #if defined( _WIN32 ) while( InterlockedCompareExchange( SpinLockPtr, 1, 0 ) ) { Sleep( 0 ); } #elif defined( __APPLE__ ) while( !__sync_bool_compare_and_swap( SpinLockPtr, 0, 1 ) ) { pthread_yield_np( ); } #endif }
In Windows I use the InterlockedCompareExchange function, while in OSX I use __sync_bool_compare_and_swap. These functions perform the same task, though inconveniently take the parameters in a different order.
The spinlock acquire function runs in a loop. In an atomic operation it compares the value of the spinlock with 0 and if it is 0 it changes the value to 1, otherwise it leaves it alone. If the value was 0, it is now 1 and the loop exits. If the value is 1 to start with the loop then yields execution for the thread and then tries again. It keeps on trying until the value becomes 0, then it “acquires” it by setting it 1 and leaving the loop.
Whenever a thread returns from the ThreadSpinLockAcquire function it is the only thread that currently posses the spinlock. All other threads attempting to acquire it will spin until it is released.
The release function is simply setting the value back to zero
////////////////////////////////////////////////////////////////////////////////// // ThreadSpinLockRelease // // Releases the thread spinlock. Do not call this unless you have acquired the // spinlock. ////////////////////////////////////////////////////////////////////////////////// void ThreadSpinLockRelease ( THREADSPINLOCK* SpinLockPtr ) { *SpinLockPtr = 0; }
The spinlock mechanism is very simple to implement, it also does not require code to initialise unlike most synchronisation mechanisms. It is however inefficient. While other threads are waiting to acquire the spinlock they “spin”. Meaning they are constantly running and checking the value. To improve performance we put a thread yield instruction in there. When there are multiple threads waiting on a spinlock, it is just luck which thread will get it once the lock has been released. OS provided synchronisation methods will generally have a better scheduling system to determine who gets the lock next. Also the threads don’t get scheduled in until they are ready to receive the lock. So for highly contented locks, a system provided mechanism such as a Mutex should be used. But the spinlock is simpler and if used right can be sufficient.
The higher the contention of a lock the more inefficient it becomes. If you have 8 threads spending most of their time waiting on a lock, then you haven’t gained much with multiprocessing. Ideally a lock is rarely contended, which means thread execution is not interrupted. A spinlock should be held for the shortest amount of time possible. Perform all the operations you can before acquiring it. For example if you were entering an element into a linked list. It is best to allocate the new element first, fill in all its entries. Then acquire the spinlock, attach the element to the global linked list, and then release. Don’t perform the allocation and the setting of the entries within the spinlock as that is holding up other threads unnecessarily.
A C program starts (usually in its main function) running one thread. In order to become multithreaded you have to create more threads. This is is platform specific. Unix like platforms use pthreads, Windows uses its own thing (CreateThread). There are a lot of functions relating to threads. Once you create one, you can monitor them, wait for them to close etc. For my task I just simple want to create threads and let them run until the program ends. I wrote a simple wrapped for Window’s CreateThread and pthread’s pthread_create
//////////////////////////////////////////////////////////////////////////////// // ThreadLaunchNewThread // // Launches a new thread. Passes the pointer Context through to the new thread // function. Returns zero value on failure, or 1 if success //////////////////////////////////////////////////////////////////////////////// int ThreadLaunchNewThread ( ThreadStartFunction StartFunction, void* Context ) { #if defined( _WIN32 ) HANDLE threadHandle = 0; #elif defined( __APPLE__ ) pthread_t threadHandle = 0; pthread_attr_t attributes; int errors; #endif uint32_t success = 0; InternalThreadStartParameters startParams = {0}; startParams.StartFunction = StartFunction; startParams.Context = Context; ThreadSpinLockAcquire( &startParams.StartSpinLock ); #if defined( _WIN32 ) threadHandle = CreateThread( NULL, 0, InternalThreadStart, &startParams, 0, NULL ); if( NULL != threadHandle ) #elif defined( __APPLE__ ) pthread_attr_init( &attributes ); errors = pthread_create( &threadHandle, &attributes, InternalThreadStart, &startParams ); if( 0 == errors ) #endif { // The thread will release the spinlock, so attempt to reacquire it. This will mean we // will be in sync. ThreadSpinLockAcquire( &startParams.StartSpinLock ); // Close the handle as we do not need it anymore. Thread has been created #if defined( _WIN32 ) CloseHandle( threadHandle ); #elif defined( __APPLE__ ) pthread_attr_destroy( &attributes ); #endif success = 1; } else { // Failed to create thread success = 0; } return success; } //////////////////////////////////////////////////////////////////////////////// // InternalThreadStart // // This is the function stub that is called by The windows CreateThread // function. This will launch our thread. //////////////////////////////////////////////////////////////////////////////// #if defined( _WIN32 ) static DWORD WINAPI InternalThreadStart ( LPVOID Parameter ) #elif defined( __APPLE__ ) static void* InternalThreadStart ( void* Parameter ) #endif { InternalThreadStartParameters* parameters = (InternalThreadStartParameters*)Parameter; ThreadStartFunction startFunction = 0; void* context = 0; startFunction = parameters->StartFunction; context = parameters->Context; // Release spinlock, After this we can no longer access parameters. ThreadSpinLockRelease( ¶meters->StartSpinLock ); // Call start function startFunction( context ); // Thread ends return 0; }
This function ThreadLaunchNewThread takes two parameters. The first is the thread function you want the thread to start with. The second is an arbitrary pointer that will be passed to the thread function. The thread function has a simple prototype:
typedef void (*ThreadStartFunction) ( void* Context );
The thread will run until this function returns.
I now have all the components I need for my article on Proof-of-Work. I added just one more function to the thread module. A function that lowers the priority of the program. I want my multithreaded program to use the maximum performance of the computer. I don’t however want to disable my computer from any other use. So I want to set the priority low so that all other programs get priority over it. That way it just uses up the CPU cycles that are otherwise not being used, rather than hogging the CPU itself.
////////////////////////////////////////////////////////////////////////////////// // ThreadSetProcessPriorityToBackground // // Sets the process priority to Background mode ////////////////////////////////////////////////////////////////////////////////// void ThreadSetProcessPriorityToBackground ( void ) { #if defined( _WIN32 ) // Set priority of process to background SetPriorityClass( GetCurrentProcess(), PROCESS_MODE_BACKGROUND_BEGIN ); #elif defined( __APPLE__ ) setpriority( PRIO_PROCESS, 0, 10 ); #endif }
This function will lower the priority of the the process. In Windows it uses a PROCESS_MODE_BACKGROUND_BEGIN which is only available since Vista. For XP and earlier you can use BELOW_NORMAL_PRIORITY_CLASS or IDLE_PRIORITY_CLASS. In OSX this applies “nice 10″ to the process.
In the next article I will put this together with the MD5 and RC4 routines to create my Proof-of-Work example program.