What's a Merkle Tree?

Merkle trees are binary trees of hashes. Each node in the Merkle tree hashes its input data. Nodes concatenate the hashes of their children before hashing.

Each node in the Merkle tree performs MD5 hashing of input data. Parent nodes concatenate the hashes of their children before hashing. (Credit: T. L. Oliver CC BY-SA 4.0)


If you've tried to understand how Bitcoin or file-sharing services, like DC++ work, then you would have realized that they both employ something called, 'Merkle trees.' Merkle trees are very useful for determining whether a piece of data is supposed to be in your data set which is quite important when transferring and storing many gigabytes of file data. In order to understand Merkle trees, you first have to understand what hash values and trees are and how they work.

Hash Functions & Hash Values

A hash function takes a string of data as input and outputs a fixed size block of data. Hash functions are used to verify the integrity of data regardless of the size of that data. Due to what's known as the avalanche effect, a very tiny change to the input to a hash function will result in a completely different output. This feature makes it astronomically difficult to alter the input in a way that leaves the output unchanged. Tampering with the contents of a file would be obvious because if we want to know if a file has been altered or corrupted after it was sent over the internet or stored on a drive, we compare the current hash output of the file with a reliable, known hash value. So, in a practical sense, it allows one to prove that input data hasn't been modified.

Modifying just one word in a sentence completely changes the hash value. "The red fox runs across the ice hashes to 52ED879E", but "The red fox walks across the ice" hashes to 460242841.

Hash functions relate input data to a particular hash value. Trying to determine the input data from a hash value is infeasible.

Trees

A tree is a data structure that allows for data elements to be quickly searched even if there are millions or billions of items. Data are stored in nodes. Much like an actual tree, a tree data structure consists of a root node which may be connected to child nodes each of which may also form branches to other child nodes and so on. If a node has no child nodes, then it called a leaf. In other words, a tree consists of a node connected to a set of child trees. Binary trees are a special case of the tree structure in which each node can have no more than 2 children.

#

The tree data structure resembles a living tree (except upside-down) insomuch that it continually branches off from nodes up to its leaves. (Credit: Love Sun and Dreams GFDL or CC BY-SA 3.0)

Merkle Trees

A Merkle tree is a tree (typically a binary tree) of hash values. Each node in the tree is a hash of the concatenation of its child nodes. Each leaf in the tree is the hash of a data block. Merkle trees are used to efficiently and securely verify the integrity of large files or many blocks of data. Because hashes are much smaller in size than the data blocks, storage and transmission of these hashes are much faster and efficient than using the data blocks for the same purpose. As long as the root hash is obtained from a trustworthy source, then the remainder of the tree can be obtained from less trusted sources. If one source were to send a damaged or incorrect tree, then it would be detected, and the tree can be redownloaded from another source.

Modifying one of the data blocks, even slightly from 'three' to 'tree', will trigger changes in hash values all the way up to the root hash. This modified root hash value indicates data corruption.

Corruption in a data block will alter its hash which will propagate upwards towards the root. (Credit: T. L. Oliver CC BY-SA 4.0)


If you have the root hash "d50bdf..." and you want to know whether "three" is the value of your third block of data, then instead of having to check every block, you'd only need two hashes: "8cbad9..." and "2111c0...," which are the left child of the root hash and the right child of the right child of the root hash, respectively.