Sometime in the summer, I came across a nice blog post by Aatish Bhatia on entropy. Here I will echo on this topic with more technical details.

Two communities use entropy often: physicists and information theorists. My experience in quantum information theory put me in a position to see its usage from both communities. Thus in this post I will introduce terminologies from both statistical physics and information theory.

If you are totally new to entropy, I strongly recommend learning it from information theory because the treatment there is more general and cleaner. If you don’t have time to read a full book on modern communication systems, you can try this little book by Philip Woodward. He is a famous mathematician and radar engineer with great clarity in writing. Also he was born on 1919 and is still alive.

To make the discussion concrete, I will use coin tossing as examples.


At high level, entropy measures the complexity of a system. Specifically, complexity consists of two aspects

  • diversity
  • spread-out

Diversity is related to the number of possible outcomes whereas spread-out is related to the likelihood of the outcomes. For example, uniform distribution is more spread-out than Gaussian distribution.

The term “system” deserves some clarification too. Typically, it refers to a physical entity whose property can be observed in a repeated manner and statistics can be collected. For example,

  • a coin can be tossed repeatedly and the probability of getting head can be computed
  • a particle can be traced over time and its probability of being at certain location and having certain velocity can be computed
  • a group of quantum mechanical spins can be prepared in some initial state and their total spin can be measured after a fixed time span. This prepare-wait-measure procedure can be repeated many times to obtain the statistics of the total spin distribution.

Specifically, a system contains two parts

  • state space
  • probability distribution

State space is the set of all possible outcomes and probability distribution is the assignment of probability to each outcome. Taking coin tossing for example, if we are interested in tossing one coin, then the state space is {H, T} where H denotes head and T denotes tail. If we study two coins, then the state space is {HH, HT, TH, TT}.

Note that a larger state space does not guarantee a larger entropy. Suppose the system of interest consists of two very biased coins. Although there are 4 possible outcomes, the “effective state space” contains close to 1 outcome, say HH. On the other hand, if the system is a fair coin, there are 2 outcomes, both of which occur often. Thus intuitively, the complexity of the fair coin tossing should be bigger than the biased two-coin tossing.

In my opinion, entropy captures more or less this concept of “effective state space” cardinality.

a tale of two definitions

The confusion of entropy originates from its two definitions

\[S \equiv \log \Gamma \\ S \equiv - \sum_i p_i \log p_i\]

Here \(S\) is entropy, \(\Gamma\) is the number of states (state space cardinality), and \(p_i\) is the probability to be in state \(i\). I also omit the Boltzmann constant \(k_B\) to save typing (computer scientists do not include it either). Note also that \(\Gamma\) is not the partition function.

definition 1

The first definition is very intuitive. It is the state space cardinality in logarithmic unit. The choice of logarithm in the definition is due to two of its convenient properties:

  • \(\log 1 = 0\): the system has 0 entropy if there is only 1 outcome
  • \(\log mn = \log m + \log n\): entropy of independent subsystems can be added

This definition of entropy considers diversity but not spread-out. It is thus a rough measure of complexity. In information theory, it is called information capacity. In physics, it is only used when all outcomes occur with equal probability, i.e., the so-called micro-canonical ensemble.

Although definition 1 is a simple definition to measure state space cardinality, it plays important role in both physics and information theory. In physics, it is related to what temperature is. In information theory, it is related to how many bits are needed to store or transmit information.

Each computer bit has two states, 0 and 1. Thus one bit is enough to store the result of a coin tossing, i.e., \(\log_2 2 = 1\). With two coins, the number of bits needed is \(\log_2 2\times2 = 2\). Thus entropy has the concrete meaning of number of bits needed to store the observation outcomes.

I won’t expand the definition of temperature in statistical physics here except for pointing out that

\[\frac{1}{T} \equiv \frac{\partial S(E)}{\partial E}\]

In other words, temperature measures how quickly the number of states increases as the energy of the state increases. Most statistical mechanics books talk about this aspect of temperature. If you are not particularly patient, you can read this short book by Kerson Huang. He is a famous particle physicist who lived to age 88.

definition 2

The second definition includes both the diversity and the spread-out.

Admittedly, the formula looks a little strange. Let’s go over an example of coin tossing to see how to transition from definition 1 to definition 2.

Suppose we have a coin whose head and tail probabilities are \(p_H\) and \(p_T\), with \(p_H+p_T=1\). The essence of definition 1 is state counting. To reconcile state counting with probability distributions, we need to generalize the state space: instead of tossing the coin once and using {H, T} as state space, we can toss it \(N\) times and the state space expands to \(\Gamma=2^N\) outcomes.

With \(N\) approaches infinity, the number of heads approaches \(N_H=p_HN\) and there are \(C(N, N_H)\) ways to achieve that, where \(C(n,k)\) is the n-choose-k function. Thus the entropy can be defined as

\[S = \lim_{N\rightarrow\infty}\frac{\log_2 C(N,N_H)}{N}\]

Here the numerator is in the form of definition 1, but is extended to \(N\) observations such that probability distribution can be taken into account. The \(N\) in the denominator is for normalization purpose since the numerator obviously increases with \(N\). It can be seen as \(\log_2 2^N\) which is the information capacity of tossing \(N\) coins.

Using Stirling’s approximation

\[N!\simeq N\log N -N + O(\log N)\]

we end up with the binary entropy function

\[S = -p_H\log p_H - p_T\log p_T\]

I also stole a plot from wikipedia in Figure 1, where the vertical axis can be seen as \(S(p_H)\) and horizontal axis as \(p_H\). Note that it achieves its maximum of 1 when the coin is fair and drops to 0 when the coin is fully biased.

entropy function

Figure 1. Binary entropy function.

In my impression, physicists are less concerned with entropy than with other thermodynamic quantities because

  • it is usually not experimentally measurable
  • it is tricky to calculate
  • it is less useful than Helmholtz free energy (could be my bias)

On the other hand, entropy (in definition 2) has concrete meaning in information theory. It is called the information content, which is the minimum average capacity to store information (recall the effective state space cardinality!). It is essential in applications like data compression, encoding, etc.

For example, if the coin tossing always gives head, we don’t need to store any information of the outcome or transmit it over a communication channel, which matches up with its 0 entropy. If the coin is biased to head with \(p_H=0.999\) and we toss it 1000 times. Instead of recording the outcomes of the 1000 tosses as \(HTTH\ldots HH\) which requires 1000 bits, we can record only the indices of the tosses with tail outcome. On average there is only 1 index to record thus it only costs \(\log_2 1000 \simeq 10\) bits. Dividing this 10 by the number of tosses 1000, we basically get the entropy of this biased coin.

end words

Finally I would like to explain a bit more on why I recommend learning entropy from information theory instead of statistical physics for a novice. It basically boils down to the extra assumptions in physics about the probability distributions of states.

In physics, it is postulated that the probability of a state is proportional to \(\exp(-\frac{E}{k_B T})\) where \(E\) is the energy of the state, \(T\) is temperature, and \(k_B\) is the so-called Boltzmann constant . It is not uncommon that you don’t even see other probability distributions being considered in physics textbooks.

Another issue with the physics approach (at least for me) is that things only start to make sense after learning a lot of materials. For example, as you see earlier, temperature can be defined via entropy. Meanwhile, entropy depends on probability distribution and probability distribution depends on temperature. We basically have three snakes biting each neighbor’s tail.

I find these cyclic graph structure of definitions quite common in physics, in contrast to the directed acyclic graph structure or even tree structure in math. The nice thing about it is that we can use any point as the starting point. And an unconventional starting point usually provides a refreshing point of view on how the concepts relate to each other. For example, Richard Feynman uses partition function as the starting point in his statistical mechanics book, which makes the book quite interesting.

On the other hand, the information theory approach is more like pure math. It doesn’t depend on many concepts and is relatively concise. You can probably learn it much faster than the statistical physics approach.