In these notes and Latent variable I’ll develop some basic notions up to the “ELBO” in a manner that seems to me more beautiful and unified than the normal approach.

I’ll take the log-likelihood, which is equal to the negative cross-entropy, as the foundational notion. And I’ll immediately understand it in terms of the probability almost-surely assigned by a model to a large IID sample.


Definition. For a sampling distribution $p$ and a model distribution $q$, the log likelihood assigned by $q$ to $p$ is:

$$\ell\ell(p; q) := \mathbb E_{x \sim p}[\log q(x)]$$

This quantity precisely determines the probability given by $q$ to large IID samples from $p$.

Theorem. If $x_1, x_2, \dots$ are a sequence of IID samples from $p$, then

$$P\left(\lim_{n \to \infty} (q(x_1)\cdots q(x_n))^{1/n} = \exp(\ell\ell(p; q)\right) = 1$$

Proof.

\begin{align*} &P\left(\lim_{n \to \infty} \left(\prod_{i=1}^nq(x_i)\right)^{1/n} = \exp\ell\ell(p; q)\right) \\ &= P\left(\lim_{n \to \infty} \log\left(\prod_{i=1}^nq(x_i)\right)^{1/n} = \ell\ell(p; q)\right) \\ &= P\left(\lim_{n \to \infty} \frac1n\sum_{i=1}^n\log q(x_i) = \mathbb E_{x \sim p}[\log q(x)]\right) \\ &= 1. \end{align*}

Where the final step is by the strong law of large numbers. $\square$


Therefore, given a sampling distribution $p$:

  • Selecting the model $q$ to maximize $\ell\ell(p; q)$ – which is the standard method of training models (e.g., classifiers, LLMs) – is exactly equivalent to maximizing the likelihood it assigns to any large IID sample.
  • (Universality of log-likelihood:) Let $q_1, q_2$ be any pair of probability models. Suppose we evaluate them both as follows: we take an IID sample of data, and then score them both based on an arbitrary monotone function of the probability that they each assign to the sample (e.g., any kind of gambling reward). Then for a large enough sample, the one with better log-likelihood will always win.

Theorem. For any two distributions $p, q$,

$$\ell\ell(p; p) \geq \ell\ell(p; q).$$

Proof.

\begin{align*} \ell\ell(p; q) - \ell\ell(p; p) &= \mathbb E_{x \sim p}[\log(q(x)/p(x))] \\ &\geq \log\mathbb E_{x \sim p}[q(x)/p(x)] \\ &= \log(1) = 0. \end{align*}

Remark: This theorem establishes the intuition that “the best model is reality itself”.


Therefore, we can consider the quantity $\ell\ell(p; p) - \ell\ell(p; q) \geq 0$ to be the gap between the best possible model, which is $p$, and our current model $q$. (This is typically called the KL-divergence.)


Theorem. Let $p_{x,z}$ and $q_{x,z}$ be joint distributions over $(X, Z)$, and $p_x, q_x, p_{z \mid x}, q_{z \mid x}$ be their respective marginal and conditional distributions. Then:

$$\ell\ell(p_{x,z}; p_{x,z}) - \ell\ell(p_{x,z}; q_{x,z}) \geq \ell\ell(p_x; p_x) - \ell\ell(p_x; q_x)$$

Proof.

\begin{align*} &\ell\ell(p_{x,z}; p_{x,z}) - \ell\ell(p_{x,z}; q_{x,z})\\ &= \mathbb E_{(x,z) \sim p}[\log p(x, z) - \log q(x, z)]\\ &= \mathbb E_{(x,z) \sim p}[(\log p(z \mid x) - \log q(z \mid x)) + (\log p(x) - \log q(x))]\\ &= \mathbb E_{x \sim p_x}\mathbb E_{z \sim p_{z \mid x}}[\log p(z \mid x) - \log q(z \mid x)] + \mathbb E_{x \sim p_x}[\log p(x) - \log q(x)]\\ &= \mathbb E_{x \sim p}[\ell\ell(p_{z \mid x}, p_{z \mid x}) - \ell\ell(p_{z \mid x}, q_{z \mid x})] + \ell\ell(p_x, p_x) - \ell\ell(p_x, q_x)\\ &\geq \ell\ell(p_x; p_x) - \ell\ell(p_x; q_x). \end{align*}

$\square$

Remark: This theorem establishes the intuition that it is always harder to model more variables at once. This will be important for variational inference.


Example: Fitting a Gaussian

Let $\mathcal D = \frac1n\sum_{i=1}^n \delta_{x_i}$ be the empirical distribution over a dataset $(x_1, \dots, x_n)$. And let

$$p_{\mu, \sigma^2}(x) = \tfrac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\tfrac{(x - \mu)^2}{2\sigma^2}\right)$$

be the Gaussian probability model with free parameters $\mu, \sigma^2$.

Theorem. $\ell\ell(\mathcal D; p_{\mu, \sigma^2})$ is maximized by $\mu = \frac1n \sum_{i=1}^n x_i$ and $\sigma^2 = \frac1n \sum_{i=1}^n (x_i - \mu)^2$.


Proof.

The log-likelihood is:

\begin{align*} \ell\ell(\mathcal D; p) &= \frac1n \sum_{i=1}^n \log p(x_i)\\ &= \frac1n \sum_{i=1}^n -\frac{(x_i-\mu)^2}{2\sigma^2} - \log\left(\sqrt{2\pi\sigma^2}\right) \\ &= -\left(\log(\sqrt{2\pi}) + \log(\sigma) + \frac1n \sum_{i=1}^n \frac{(x_i-\mu)^2}{2\sigma^2}\right) \end{align*}

To maximize it we must calculate its gradient with respect to the parameters.

We first solve for $\mu$; the gradient is:

\begin{align*} \nabla_\mu\,\ell\ell(\mathcal D; p) &= -\frac{1}{\sigma^2}\frac1n\sum_{i=1}^n (x_i-\mu) \\ &= \frac{1}{\sigma^2}\left(\mu - \frac1n\sum_{i=1}^n x_i\right) \end{align*}

which is solved by $\mu = \frac1n \sum_{i=1}^n x$.

The gradient for $\sigma$ is:

\begin{align*} \nabla_\sigma\,\ell\ell(\mathcal D, p) &= -\left(\frac{1}{\sigma} - \frac1n\sum_{i=1}^n \frac{(x-\mu)^2}{\sigma^3}\right) \\ &= -\frac{1}{\sigma^3}\left(\sigma^2 - \frac{1}{n}\sum_{i=1}^n (x-\mu)^2 \right) \end{align*}

which is solved by $\sigma^2 = \frac1n \sum_{i=1}^n (x - \mu)^2$. $\square$