This article is about the concept of Proof of Work. I have written a program in C demonstrating the concept. It is downloadable here: ProofOfWork.zip
The term Proof of Work in computing is an interesting concept. It means producing some output that other people can verify would “have taken a certain amount of work” to have accomplished. The important part is that the verification is easy, while the actual work is hard.
There are many computational things that take a long time to perform, but easier to verify the result. For example finding a prime number. It is harder to find a prime number than it is to verify that a given number is prime. So a proof of work system could be worked using this. The required work could be to find a random prime number greater than some large threshold. In order to “prove” you have performed the work you present the prime number. I can verify that it is a prime number without as much work as it took to find it. Therefore “proving” that you probably spent a measurable amount of work finding it. This would not make a very good scheme, for one the threshold would have to be changed all the time to avoid repeats, also verifying a number is prime is still relatively expensive and becomes more difficult as the numbers get larger.
A better proof of work system is one involving hash calculations. The nature of a good hash algorithm is that it is impractical to influence the output of the hash value. Basically any slight change in the input will cause a completely random-like hash value. A hash value can be considered a large integer value. For example MD5 produces 128 bit hashes, while SHA2 typically produces 256 or 512 bit hashes. For the remainder of the article I am going to use MD5, but note MD5 is NOT a good hash algorithm. However it is fast and still demonstrates my point. Though for an actual system, the hash would have to be replaced with a more secure hash such as SHA2.
A hash looks like a random number. If we hash any arbitrary data we will get a 128 bit number that appears to be random. If we change the original data a bit and rehash we will get another 128 bit random looking number. Assuming the hash is good (MD5 isn’t, but lets ignore that for the moment) there is no way I can manipulate the data in order to produce a hash value that I want. Basically any change I make will cause a new random hash. Pretty much like rolling a 2^128 sided die.
In the case of a regular six sided die, fair throwing of the die will produce a random value from 1 to 6. If the “desired” outcome of a die throw is a 6, then I have to keep throwing the die until I get a 6. I might get it on the first outcome, I might have to throw it 1000 times before I get a 6. But probability theory suggests that there is a bit over 50% chance that I will get it will get it within 3 throws. Taken another way, if we have thousands of people in a room who have to roll a die until it is a 6, and then report the number of throws they did, then we will see a nice statistical distribution of the numbers of throws required. About 1/6 of the people will throw a 6 on the first try. The remaining 5/6 people will have to throw it more times, out of which 1/6 will get it in there next try, this is 5/36 (or 13.9%) of all the people who took exactly two throws. The following table shows the proportions of people getting the throw after n throws, and also lists the accumulation (ie people who got it within n throws)
Throws % On this throw % By this throw
1 16.7% 16.7%
2 13.9% 30.6%
3 11.6% 42.1%
4 9.6% 51.8%
5 8.0% 59.8%
6 6.7% 66.5%
7 5.6% 72.1%
8 4.7% 76.7%
9 3.9% 80.6%
10 3.2% 83.8%
As a rather contrived example, lets make a proof of work system based on rolling a die and trying to get a six. The “work” is rolling the die until it is a six. The “verification” is looking at the number on the die and seeing that it is a six. If someone presents a die rolled as 6 I can verify it easily without much work, I can also make some assumptions based on probability of how much work was required to get that. Although you may have only had to roll it once, it is more likely that you had to roll it more times. There is an almost 70% chance that it took you at least three throws to achieve it. Obviously the die rolling scheme is not particularly good as the verification system is rather flawed. Unless I actually watched you throw the die, I can’t tell that you didn’t just present me a die with the six side up and didn’t throw it at all. However in the case of our 2^128 sided die (aka the hash of a random value) I don’t need to watch the generation, as I can simply verify it from the original input.
In the case of the hash proof of work we can make a system similar to the die roll, but instead of rolling a die cube, a random value is hashed and the result of the hash is used as the die value. Instead of just six outcomes there are 2^128 outcomes (That is 340282366920938463463374607431768211456 which is a lot). Its rather unreasonable to specify a single value as the required result as it is incredibly unlikely that anyone would ever get it within the life of the universe. However we can set an arbitrary threshold. We specify a number that the hash must be greater than. For example we could set a threshold being a number where the first 40 bits must be all 1s (or 10 Fs when looking at it in hexadecimal). The work required is to keep on hashing random data until a hash is produced that is larger than this threshold (ie starts with 10 Fs). I can verify with one hash operation that the data you claim achieves this hash does in fact do so. I can also be relatively confident that it took a while to generate, as there would have had to be a great deal of hashes generated in order to find on that reached the threshold.
50% of all hashes start with the first bit being 1. 25% of all hashes start with the first 2 bits being 1, and 12.5% have the first 3. Just like the die roll calculation on the six sided die we can do the same for the hash outputs. Of course you could have come up with the hash that reached the threshold on the very first try, but the odds are against you. If you produce some data that has a hash value of the first 40 bits being 1s then I can feel reasonably confident that it probably took you 10s or 100s of billions of hash values in order to find that one. For example the MD5 hash of the string Xnpbsz2zBw is fffffffff62bf72e14b9ee1b8b923835. This has the first 36 bits all 1s. It took my computer over 10 billion attempts before reaching that threshold, however it can be verified by a single hash operation now by anyone else.
The Bitcoin system uses a similar system of proof of work in order to control the generation of bitcoins. Bitcoin uses 256 bit SHA2 for its hash function, and the threshold is a value that the hash must be lower than. In other words starting with several 0s. As of the time of writing the current bitcoin threshold requires a hash with at least the first 55 bits zero.
Of course just having the requirement that any random data produces a hash that meets the threshold is not good enough for a proof of work system, as there is nothing that proves when the data was generated, or if its been used before. In the case of bitcoin, it is the current transaction block that must be hashed along with some random data. This means that once that block has been done, there is no use trying to reuse that value.
There is a system called HashCash which was devised as a method to slow spam. The concept was that each email sent would require a hashcash token which proves a certain amount of computer work was performed. The threshold was low enough so that it wouldn’t disrupt normal users, but would be a hinderance to spammers attempting to send millions of emails at once. Anti spam filters could verify the hashcash was good with just one single SHA1 calculation.
The whole concept of Proof-of-work fascinates me, and so I wrote a program to demonstrate it. This program requires the components I have talked about in the previous articles. Namely MD5 for the Hashing, RC4 for Random number generation, and Threading in order to get maximum performance out of a multi core CPU.
ProofOfWork.zip contains the full source code and also a Windows x64 executable and an OSX x64 executable for generating proof of work hashes.
The program is simple to operate
ProofOfWork: <string> <numCharsAdded> <numThreads>
The first parameter is the base string. The second parameter specifies the number of characters that will be added to the base string to feed into the hash function. And the third parameters specifies how many threads will be launched to perform the calculations. To get maximum CPU usage, the number of threads should equal the number of CPU cores on the system.
Each thread seeds a RC4 output stream with some time values and thread addresses. The RC4 streams are used to produce random values that get appended to the string. The values are restricted to a 55 character alphabet, containing letters and numbers without certain confusing characters such as 0 and o and 1 and l etc. The program maintains two thresholds, a current lowest hash and a current highest hash. Everytime a hash is generated that reaches a threshold, the hash is printed and that becomes the new threshold.
Here is some sample output from the program
>ProofOfWork waterjuice.org: 8 8
Is Smaller (24.19) 0.036 GHashes in 00:00:02 (h:m:s)
Is Smaller (24.68) 0.040 GHashes in 00:00:02 (h:m:s)
Is Larger (24.16) 0.054 GHashes in 00:00:03 (h:m:s)
Is Larger (31.26) 0.062 GHashes in 00:00:04 (h:m:s)
Is Smaller (27.28) 0.088 GHashes in 00:00:06 (h:m:s)
Is Smaller (28.76) 0.420 GHashes in 00:00:42 (h:m:s)
Is Smaller (30.39) 0.642 GHashes in 00:01:05 (h:m:s)
Is Smaller (30.79) 1.265 GHashes in 00:02:11 (h:m:s)
Is Larger (33.59) 1.523 GHashes in 00:02:38 (h:m:s)
Each line displays the text used to create the hash followed by the MD5 hash. It indicates if the hash was smaller or larger than the current threshold. It displays the number of GHashes (Billions of hash calculations) performed before the hash was found, along with the time taken. The program quickly puts out the first results, but as the thresholds get harder the time taken takes much longer.
The value in brackets is a measure of difficulty. The integer part of that value indicates how many of the first bits are all 0s or all 1s.
I have also experimented with various other thresholds besides just having a “large” or “small” hash value. For example a hash with the fewest or greatest number of 1 bits (ie the hamming weight of the hash). Or a hash with only numeral digits or only letter (a-f) digits.
I found the whole exercice a lot of fun, and maybe someone else will too. I have put my program and all its source code in the public domain. Feel free to use it how you wish and experiment with other it.
Download source and binaries: ProofOfWork.zip
This is free and unencumbered software released into the public domain.