# Hashing Explained

Demystifying Hashing

Hashing is a broadly studied topic among mathematicians, in fact Hash Functions are attractive mainly for their many applications in the modern Computer Science.

Very often, it is possible to see confusion speaking about Hashing. People confuse Hashing with Base-changes (e.g. from Base-10 to Base-32 or Base-8); some other confuse Hashing with Random Number Generation (e.g. generating randomnly an integer, like the outcome of a couple of dice rolling); worst, confusion with Hashing and Pseudo-Random Number Generator because of the similarity in the generating functions.

Let’s make light on this.

• Base Change. Given a bit string, it defines a bit-encoding strategy, do not be confused with any Coding Theory, it is just the simplest way to represent patterns of bit in a human readable way. For example, having a bit string composed by 8 bit (i.e. a Byte), 10110011 base 2, this same information can be represented in base 16 using a bit-encoding strategy that encodes clusters of 4 bit, obtaining the following B3 base 16; such encoding is often known as Hex-encoding.
• Random Number Generator. A subset of Random Processes that generate as outcome a number; for examples, an experiment that rolls over two or more dices is properly said a Random Process, and its outcome is a random number. It is important to remember that natural phenomena are rightly defined “random” (e.g. white noise, thermal noise, etc.), and computers cannot generate properly said random process: the level of entropy (i.e. measure of disorder, of caos, etc.) is not enough to properly approximate natural phenomena, for this reason in Computer Security the Hardware is used to implement random experiments to generate, for example, random seed numbers.
• Pseudo-Random Number Generator. A Computer cannot generate random numbers, as per above. A set of algorithms known as PRGs by one of its pioneers D. Knuth – the Guru of the modern Computer Science, allows to generate a pseudorandom number starting by a given seed; an example is the Linear Congruential Generator by D. Knuth that makes use of the following modular recurrent equation: $X_{n}=(aX_{n-1}+c) \mod m$, in which $X_0$ is the seed.
• Hashing. Better defined in the following paragraphs, it is a surjective function/application $f(\cdot)$ that associates elements of a domain $X$ to elements of a codomain $Y$: $y = f(x)$, for all $x\in X$ exist a $y\in Y$ and $f(x)=y$ is a licit application.

Therefore, Hashing is not Base/Representation Changing, is not Random Number Generation, is not Pseudo-Random Number Generation, but Hashing is a surjective math application in a general sense (afterwards it will be specialized for the Computer Science case).

Hashing Vs Cryptographic Hashing

An important difference has to be done between Hashing intended as mathematical application in the Computer Science, and Cryptographic Hashing intended as mathematical application to the Computer Security (a subset of the Computer Science).

Hash Functions have to show the following properties:

1. Speed. Hash has to be calculated fast enough to make it viable in general purpose applications; speed is related to the algorithmic complexity, so very fast Hash Functions may be weak, not respecting the remaining below properties.
2. Avalanche Effect. Hash has to be able to create big differences in the hashed patterns for slight modifications of the input patterns (intended as bit strings).
3. Collisions. By its mathematical nature, a Hash Function generates collisions, where several input strings have the same corresponding hashed string (still intended as bit strings); higher is the probability of collision, weaker is the Hash Function as per its definition.
4. Consistency. For identical inputs, the function has to generate identical outputs.

Those are the fundamental requirements for a function to be defined a Hash.

Often, confusion raises in talking about Hashing and Cryptographic Hashing, let’s see which is the main difference. Basically, a Hashing Algorithm/Function  exhibiting very good levels of Avalanche Effect (i.e. changing only one bit in the original bit string produces a flipping of more than a half bit in the hashed bit string) can be said with good trust a Cryptographic Hash Algorithm, since it may be:

• unfeasible to generate the original bit string from the hash bit string;
• unfeasible to modify the original bit string without modifying the hash bit string;
• unfeasible to find different bit strings having the same hash bit string;

according to the properties above, as clear, it is unfeasible a collision-based attack to reverse engineering the Hashing Algorithm. Looking again at the requirements, it is clear that strongly enforcing the point 2 (enormous change of the output for a very little perturbation of the input) makes the function hard to reverse engineering, and having point 3 by its design (huge codomain $Y$), the Hash Function can be said Cryptographic strong.

Explanation

Let’s focus on the Computer Science, and see how Hash Functions are designed and how they work. For this purpose, let’s consider the below definition:

$y=f(x), X\equiv \{ 2^{i},\forall i\in N \}, Y\equiv \{ 2^{j}, \forall j\in M \}, M \leq N$

$\Downarrow$

$f:X \longrightarrow Y, |X| >> |Y|$

that is a formal math statement defining how Hash Functions generally work in the Computer Science Applications: any of the bit strings in the domain $X$ is mapped deterministically to one of the bit strings in the codomain $Y$. The picture below sketches out this math application in an intuitive way.

As above highlighted, collisions are one of the weakness of bad designed hashing functions: collisions cannot be removed because for a set of greater cardinality (e.g. random alphanumerical strings identified as $|X|$) the mapping function has to associate elements in a set of smaller cardinality (e.g. hexadecimal strings on a reduced number of characters identified as $|Y|$) but can be alleviated distributing homogeneously elements on the codomain (this avoids to create big collision clusters). In fact, one of the prerogatives of the Hashing is to reduce the length of the working string, for many reasons, but the most obvious one to Software Engineer is: let’s assume a Hast Table that is designed to store elements that are basically alphanumerical strings of random length, in such case the biggest problem consists in designing a Hash Function that properly represents the element with a number of reduced characters/bit in order to fit the Hash Table in memory (the number of alphanumeric strings having random length can be indefinite big, much bigger of any computer memory), here is a thorough explanation.

Collision Probability

As seen, collision is a crucial point in designing Hash Functions, but not only: if a decision is needed on a set of viable Functions, how to decide which one to pick up? That’s a fair good point. One of the most influencing factors should be the collision probability that, as intuitive, depends strongly on the codomain cardinality but the formal demonstration is not straightforward since it follows the Birthday Paradox, and for the matter of simplicity here it is possible to extract the closed form formula:

$Pr\{ collision \} = \frac{K^{2}}{2^{M+1}} , K << N$

where $M$ is the number of bit of the codomain strings and $K$ is the number of domain strings to be hashed that is much less than $N$ because as said, on the average case, the domain can be indefinite big and normal applications do not need to hash all domain values.

In figures, using SHA256 (hashing over 256 bit) to hash a Trillion of values (e.g. $10^{12}$ random length alphanumeric strings), the collision probability is $p_{sha256}= \frac{10^{12 \cdot 2}}{2^{256+1}} \approx 4 \cdot 10^{-54}$, or in other words, something that is not likely to happen; such probability increases exponentially making use of MD5 which works on 128 bit, in fact in this case it is possible to have $p_{md5}= \frac{10^{12 \cdot 2}}{2^{128+1}} \approx 4 \cdot 10^{-15}$ and $p_{md5} >> p_{sha256}$

Examples and Application of Hashing Algorithms

There are several branches of application in Computer Science of the Hashing, in particular it is worth to enlist some of them.

• Cyclic Redundancy Check (CRC). Check and in case correct stream of bit involved in IO operations; for example, information transferred over the network is subject to bit flipping (natural phenomena of noise on the channel) that can be recovered by using CRC techniques – there is an entire branch of the Coding Theory around this.
• Examples
• Checksum. Transform the original information into a smaller block to the purpose of checking consistency, in other words to check whether the data has been compromised or not.
• Examples
• Cryptography. Transform the original information into a block that cannot be reversed – as per above, very useful in cases the identity has to be verified and the non-repudiation has to be implemented.
• Lookup. As discussed, a typical usage is creating keys for lookup data structures.

Conclusion

In this blog post an attempt to explain Hashing in a voice that may reach everyone: strict formalism has been replaced by intuitive explanations to make the text fluid.

The primary objective was to make light on the confusion that can be experienced daily by Software Engineers: a very few number of persons understand deeply what Hashing really is. Most of the times communication is compromised and not effective because of such sources of misunderstanding.