Creating a toy cryptocurrency - part 2c the flaw and how to fix it
Hi there, welcome to the latest in a series of posts on building your own toy cryptocurrency. If you haven't been following along, please see the first post here:
https://steemit.com/cryptocurrency/@garethnelsonuk/creating-a-toy-cryptocurrency-part-1
Some apologies
I'm posting this just to give people something to read until I have time to do this properly, so the below article is unfinished but contains plenty of theory for you to digest. While i'm busy working on some cool new features on steempower.org this series of mine may take a backseat, but I do intend to finish it off so keep on reading!
Previously on steemit....
In the last article we built the start of a real P2P node and discussed message types and how to implement the basic node class in python. I also jokingly left a "homework assignment" which user @cryptomental answered in the comments - so congratulations to @cryptomental are in order.
In this article we'll discuss a bit of theory first about the much-anticipated flaw in the design of our simple filesharing node and then implement a node that does not have this flaw. This will bring us back to the cryptocurrency space in short order, and you will see how later in the article. Of course, if there's big demand for it I can post code for the flawed node - let me know in the comments!
So let's get to it!
The flaw in the filesharer
Did any of you spot the flaw in the basic design of our filesharing protocol? It's quite obvious if you ponder it and consider scaling the network up.
Essentially to find a file, our protocol broadcasts a message that is rebroadcast across the entire network without even including a TTL (Time To Live) - in certain network setups this broadcast could keep being spammed forever!
Naturally, this is a bad thing - we really want to avoid spamming the network.
Those of you familiar with the design of ethernet LANs already know part of the obvious solution - imagine each P2P node as being like a hub. An ethernet frame comes into one port and gets sent out of all the other ports. A device like this is very nostalgic to me as it brings back memories of multiplayer games of Doom and Unreal tournament.
The problem with hubs of course was that little "COL" LED - collisions. It was slow and inefficient, and we get a similar problem when we just broadcast everything in our P2P network.
In order to solve the ethernet hub issue, we invented switches - switches take an ethernet frame in one port, lookup in an internal table where it needs to go and sends it out the correct port so it goes ONLY to the intended destination.
We need to do the same in our P2P network - we need to send messages only to the node that has the data we seek and to no others. Unlike an ethernet switch we don't have a nice little table though - we only know our immediate peers and we don't know exactly what those peers store.
The solution involves some very sexy mathematics and some sexier still cryptography. The really sexy bit is the crypto stuff here will come in really handy for the next step in building our cryptocurrency.
Let's build a Distributed Hash Table, oh yeah!
Sexy mathematics
Ok, so our problem is finding which node has a particular file (or block) knowing only the IP addresses and port numbers of our own peers.
First, let's assume the files were stored by other nodes in the first place (otherwise it gets a lot more complex and involves distributed search - something we don't need for our coin).
When I mention hashes, a lot of people might instinctively think "hash the filename and use the result to find which peer to connect to" - which would work if the network stayed completely perfectly consistent forever and no nodes ever join or leave.
What we need to do is to always connect back to the same nodes for the same content if they're online - and if they're not online then connect to nodes that are close enough and have replicated the content. We also need a way to store our own content in the first place and ensure we can get it back again later.
So as input to our sexy hash function we have a filename and a list of peers we know about and for output we want a list of peers that we can talk to about that file.
Once we know which peers to talk to, we just need to use the normal messages to store and retrieve our data. If other nodes join the network later that match our filename, we want those nodes to catch up and replicate the data too, but we're lazy - let's make those nodes only replicate data that's actually requested and since nobody has infinite disk space each node should be able to allocate a fixed amount of space and get rid of data that isn't asked for in a long time in order to make space for data actually used.
So our function is basically SexyHash(filename,nodes) and returns a list of nodes, or something like this:
def sexy_hash(filename,known_peers):
do_cool_stuff_here
return peers_we_can_talk_to_about_that_filename
That's one cool function we need to write eh?
Let's talk about how it works and then i'll let you in on the secret and show you the code
The sexy hash function
The first trick here is to pick a hash function that we like - any decent modern cryptographic hash function will work as long as it spits out a sequence of bytes. Since we're probably all familiar with bitcoin and since it's included in the standard python hashlib module i've chosen sha256.
Next, we decide how many replicas of the information we want to store. I've chosen 3 - there's no reason for this, 3 just seemed like a cool number.
Given this, our task is to pick a subset out of the known peers and to always choose that same subset given the same input even if the list of known peers changes.
A first assumption we'll use is that the IP address of a particular node remains constant - this is not true in the real world due to dynamic IP addresses, but it works for our toy implementation (and we can easily test this on a LAN or by using different virtual machines or just binding multiple IPs on a Linux box). We assume the port number is changing all the time even for the same node, since in our testing environment it does change constantly.
Another assumption is IPv4 addresses - though IPv6 can also be used with some changes - that's a task I leave to you the reader.
So now we have those assumptions out of the way, let's make just one more: we have at all times at least 4 peers and their IPs are all different.
Let's get on with it then!
SHA256 will, given the same input, always give the same output but if we change even a single bit of the input, the output should change wildly in something called the cascade effect. These are both highly desirable traits of cryptographic hash functions, and they'll be most useful for us now as we implement our basic algorithm finally:
Let's call our filename coolstuff.txt
First, we take the SHA256 hash of the filename, yielding a sequence of 32 bytes:
b0cae35d96e2335bd6580d7130893eab5b3ac4c4e500c9e2b5a9270eccf2a1db (this is in hex of course)
Next, we take each of our peer's IP addresses and XOR all the bytes in each address together:
192.168.1.1 for example becomes 192^168^1^1 which becomes 104
Now, we build a circle (I swear this is going somewhere) and arrange it so that 0/255 is at the top and the numbers go clockwise and wrap round. Wikipedia gives a visual example of what i'm talking about with the numbers 0-16:
Ignore those arrows for now, they'll become clear in a moment.
What we have now is a nice long list of bytes that represent our filename, and a shorter list of bytes that represent our peers.
We just find the nearest peer for each byte of the hash until we've got enough peers.
It should be noted that this is completely insane and not the standard way to build a consistent hash algorithm - but it works, and it gives a nice distribution of peers for any given file.
Let's implement this thing then!
Implementation finally
Let's go back to our fileshare node and add the hash function. We'll need to import hashlib (naturally) and add a few message handlers to finish off our node. The consistent hash function will live outside of the FileShareNode class for now and we'll name it something really creative like consistent_hash. We'll also make use of python's sets module and the Set datatype to make things simple.
As per the apologies section above, no code here - sorry!
Code will come in the next article when i've got the time for it.
Thanks.
I wonder if anyone else follows, but definitely the code and 'homeworks' are the way to continue this series.
Is there interest in finishing this off? I'm aware it's been nearly a year.
I followed and upvoted you. Please follow, comment and upvote back. Thanks!