Maximum likelihood and maximum a posteriori methods

introduction

I worked on medical imaging for several years (specifically, a sub-field of magnetic resonance imaging called quantitative susceptibility mapping which aims to calculate magnetic susceptibility, see this note for more technical details). In this post I am going to give a gentle introduction to two major methods in medical imaging: maximum likelihood (ML) and maximum a posteriori (MAP) methods.

Often times, medical imaging problems have the same setup as the deblurring or denoising problems in computer vision. A measurement of the system can be described by the following equation

\[P\mathbf x= \mathbf y\]

where \(\mathbf x\) is the desired 2D or 3D image, \(P\) corresponds to the data collecting process, and \(\mathbf y\) is the measurement result. In this post, I will use bold face letters to denote vectors, and captal letters to denote matrices. Note that not all bold faces are equal: the vector dimensions could be different.

Usually \(P\) is some kind of ‘projection’ that reduces information. For example, it could be a 2D projection of a 3D object, an average filter that blurs the image, etc. Conceptually, one can think of \(\mathbf x\) as a column vector (say \(n\times 1\)) corresponding to the flattened image, \(\mathbf y\) as a column vector (say \(m\times 1\) with \(m< n\)), and \(P\) as a fat matrix.

Since \(P\) is rank-deficient, \(\mathbf x\) cannot be solved (the same reason that \(x_1+x_2 = 3\) cannot be solved). These problems are known as ill-posed inverse problems. One obvious way to solve for \(\mathbf x\) is to take multiple measurements and stack the \(P\) matrices and \(\mathbf y\) vectors in the hope that we get full-ranked or over-determined system of

\[A\mathbf x = \mathbf b, \text{ where } A=\begin{bmatrix}P_1\\ P_2\\ \vdots\end{bmatrix}, \mathbf b = \begin{bmatrix}\mathbf y_1\\ \mathbf y_2\\ \vdots \end{bmatrix}\]

Given full row rank of \(A\), the solution is simply the Penrose pseudoinverse (at least conceptually)

\[\mathbf x = (A^\dagger A)^{-1} A^\dagger \mathbf b\]

where \(A^\dagger\) is the conjugate transpose of \(A\).

Image reconstruction of conventional CT, MRI, and super resolution imaging all follow this paradigm. To do a little better, one can further take the noise property into consideration, which is commonly known as the maximum likelihood (ML) method. Typically, the noise information can be incorporated into a weighting matrix. More details are included in the next section.

The downside of ML is that the linear system has to be full-ranked or over-determined, which amounts to many measurements, potentially too expensive or even impossible to perform. The maximum a posteriori (MAP) method solves this dilemma by supplementing extra information. Typically, it is formulated as a minimization problem. For example \(x_1+x_2 = 3\) can be solved as

\[\text{argmin}_{x_1, x_2} \|x_1+x_2-3\|^2_2 + \lambda \|x_1\|^2_2\]

where the two terms are called data term and regularization term respectively, and \(L_2\) norm is used for both of them. Here the regularization term tries to convey our prior information that \(x_1\) is small (what about 0?), and \(\lambda\) is the so-called regularization parameter which is to be chosen as well.

Depending on the problem, the form of the regularization term (and the data term) could vary a lot. The determination of \(\lambda\) is also tricky. In practice, MAP can work very well, almost like magic. By magic, I mean turning noise into Lena Soderberg.

maximum likelihood (ML)

To develop the ideas, I will use two simple examples

example 1: one unknown and one measurement

In this case, the system equation can be written as

\[y = ax + \epsilon\]

where \(\epsilon\) is noise, and is assumed to be a Gaussian random variable with zero mean and standard deviation \(\sigma\). To simplify the notation, let’s denote it as

\[\epsilon \sim g(0, \sigma)\]

In this equation, \(y\) is known because it is the outcome of the measurement, \(a\) is known because we know how the system works (or how the signal is formed), \(x\) is to be determined, and the specific value of \(\epsilon\) in the particular one measurement is not known and is usually not of interest.

I guess in this case everyone would agree on the solution

\[x = \frac{y}{a}\]

The underlying thought is that since the noise is most likely to be 0, let’s consider the 0 noise solution. This is the essence of ML method. To back up this intuition formally, we can maximize the probability of getting \(y\) given \(x\):

\[\begin{align} x_{ML} =& \text{argmax}_x P(y|x) \longrightarrow P(\epsilon)\\ =& -\text{argmin}_x \log P(y|x)\longrightarrow \frac{(y-ax)^2}{\sigma^2} \end{align}\]

In this simple example, the noise variance does not play a role and all we need to do is to solve the system equation as if no noise is present.

Similarly, if we know the noise follows a Gaussian distribution with non-zero mean, e.g.,

\[\epsilon \sim g(20, \sigma)\]

Then the solution will be

\[x = \frac{y - 20}{a}\]

example 2: two unknowns and two measurements

In this case, the system equation becomes

\[\mathbf y = A \mathbf x + \begin{bmatrix}\epsilon_1\\ \epsilon_2\end{bmatrix}\]

Here I only write out the noise vector explicitly, but both \(\mathbf x\) and \(\mathbf y\) can be written in the same way. To make it slightly more interesting, let’s assume the two noise terms have different variance, i.e.,

\[\epsilon_1\sim g(0, \sigma_1), \quad\epsilon_2\sim g(0, \sigma_2)\]

Following the same ML recipe, we get

\[\begin{align} \mathbf x_{ML} =& -\text{argmin}_{\mathbf x}\log P(\mathbf y|\mathbf x)\longrightarrow \frac{1}{\sigma_1^2}(y_1 - (A\mathbf x)_1)^2 + \frac{1}{\sigma_2^2}(y_1 - (A\mathbf x)_2)^2\\ =& \text{argmin}_{\mathbf x} \| W (\mathbf y - A \mathbf x)\|^2 \end{align}\]

where the weighting matrix \(W = \text{diag}(1/\sigma_1, 1/\sigma_2)\) and we have used the independence between the two noise terms (random variables)

\[P(\epsilon_1, \epsilon_2)=P(\epsilon_1)P( \epsilon_2)\]

To make it even more interesting, we can consider the two noise terms to be jointly Gaussian with correlations. I will leave it as an exercise to you.

It is easy to extend this formalism to arbitrary numbers of unknowns and measurements and we will always end up with

\[\mathbf x_{ML} = -\text{argmin}_{\mathbf x} \log P(\mathbf y|\mathbf x)=\text{argmin}_{\mathbf x} \| W (\mathbf y - A \mathbf x)\|^2_2\]

as long as the noises are Gaussian. Note that the weighting matrix may not be diagonal in general.

maximum a posteriori (MAP)

The easiest way to understand MAP is from the direction of problem solving. Since our noiseless system equation is

\[\mathbf y = A \mathbf x\]

we can view \(\mathbf x\) and \(\mathbf y\) as the input and output of a black box. Knowing \(\mathbf x\) to get \(\mathbf y\) is the forward problem whereas knowing \(\mathbf y\) to get \(\mathbf x\) is the inverse (or backward) problem. In the presence of noise, ML is to maximize the forward problem probability, i.e., \(P(\mathbf y|\mathbf x)\), whereas MAP is to maximize the inverse problem probability, i.e., \(P(\mathbf x|\mathbf y)\). Note that this statement holds even if \(A\) is not a linear operator. Conceptually, I would argue that MAP makes more sense because conditioning should be applied on a known quantity.

Formally, the solution can be written as

\[\begin{align} \mathbf x_{MAP} =& \text{argmax}_{\mathbf x} P(\mathbf x|\mathbf y) \\ =& \text{argmax}_{\mathbf x} P(\mathbf y|\mathbf x)P(\mathbf x)\\ =& -\text{argmin}_{\mathbf x} \log P(\mathbf x) + \log P(\mathbf y|\mathbf x) \end{align}\]

where we have used the Bayes’ rule

\[P(x|y) = \frac{P(y|x) P(x)}{P(y)}\]

and the fact that \(P(y)\) is irrelevant to the estimation of \(x\) because it doesn’t depend on \(x\).

Compared to the ML solution, the MAP solution has the extra term \(\log P(\mathbf x)\). Note that this term has nothing to do with any particular measurement and is a property of the solution \(\mathbf x\). For example, suppose we are deciding whether a fruit is apple or orange (\(x\)) based on its color (\(y\)) and some look table that maps color to fruit type (\(A\)). Then \(P(x)\) is the prior knowledge that (for example) how many apples and oranges exist on earth.

Again if we assume Gaussian noise, then we have the formal solution

\[\mathbf x_{MAP} = -\text{argmin}_{\mathbf x} \log P(\mathbf x|\mathbf y)=\text{argmin}_{\mathbf x} \| W (\mathbf y - A \mathbf x)\|^2_2 - \log P(\mathbf x)\]

where the two terms are data term and regularization term. In practice, the choice of \(\log P(\mathbf x)\) is a bit like an art and there is a big community working on these optimization problems with regularizations. Some common ones are

Tikhonov regularization, or \(L_2\) regularization
\(L_1\) regularization
\(L_p\) regularization where \(0<p<1\). In this case the optimization problem is no longer convex.
Edge-based regularization, typically for imaging processing problems
Multi-modality-based regularization
Sparsity-based regularization

These tricks are not necessarily exclusive/independent of each other.

the good, bad, and ugly (of real life)

Up to this point, you might have the false impression that ML and MAP are nice and clean (should I also say easy? since the formal solution is provided by Penrose pseudoinverse for ML and MAP when regularization is \(L_2\)). In reality, there are quite a few problems.

Firstly, we may not have an explicit \(A\) matrix, making the Penrose pseudoinverse solution useless. For example, a 3D image with size 256x256x256 has over 16 million values, thus a square matrix acting on \(\mathbf x\) would have \(2.8\times10^{14}\) entries, impractical for storage in the memory. So practically we will be solving optimization problems (recall those argmin objective functions).

Often times iterative solvers are part of the tools for the optimization problem, e.g., conjugate gradient method (CG), or gradient descent method (usually CG is preferred due to its nice convergence properties). In these iterative methods, only the action of \(A\), i.e., \(A\mathbf x\), is needed, which requires roughly the same storage as \(\mathbf x\) (maybe a little more, but in the same order of magnitude).

The problem with large implicit \(A\) matrix is two folded:

It may be slow to calculate a solution using iterative methods.
We may never know whether \(A\) has full row rank.

While the first aspect is only a matter of resources, the second one would suggest the treatment of MAP. Actually even with MAP, it is hard to tell whether the augmented system has unique solution. In reality, MAP works so well that the uniqueness concern is only for the purists. I also wrote another post on what happens if we apply CG on rank deficient system in case you are interested.

Secondly, real life noises may not be Gaussian and the regularization term may be so complicated that we end up with nonlinear and non-convex optimization problems. The formalism of ML and MAP can be applied without change in the presence of non-Gaussian noises. The only difference is that the objective function would become nonlinear.

Of course a lot of tricks have been developed for nonlinear and non-convex problem. However, my experience is that a complex model is usually bad.

Thirdly, we still need to determine the regularization parameters in MAP. As far as I know, the only way to do this is via try and error. Basically, the optimization problem needs to be solved with different values of regularization parameters. To be a little fancier, you can use the L-curve method which provides a somewhat systematic way to determine them. The idea is to plot the data term and regularization term (without the regularization parameter scaling) for each regularization paramter and find the point where both of them are small. Usually this plot has the L shape thus its name.

Another way to determine the regularization parameters is the so-called discrepancy principle. Here the idea is to tune the regularization parameters such that the data term matches the expected noise level.

You can imagine how much trouble one gets with multiple regularization terms. Let me also remind you that the regularization parameters may affect the solution, thus it is tricky to judge their goodness based on L-curve, or any other principles.

In general, tuning the regularization term and regularization parameters could be a bit like an art.

more recent development

So far I described the very basics of ML and MAP. In reality, more interesting ideas are being explored

compressive sensing: Here the goal is to sample as little as possible, sometimes well below the Nyquist rate. The idea is that aliasing and noise are intrinsically different things. Thus by randomly undersample in frequency space and utilize sparsity in some other space (for example, real images have sparse edges), image reconstruction over full frequency range may be possible.
dictionary learning: In signal processing, Fourier bases and wavelet bases are commonly used as general-purpose basis functions. But for any problem in practice, a customized basis function set will perform better. Here the idea is to construct an overly complete bases for the problem of interest and utilize sparsity constraints with respect to this basis set.