KL Divergence Made Easy

Added 2022-07-28

KL-divergence is just the expected additional surprise from the map territory mismatch

See also: Six and a half intuitions for kl divergence.

The cross entropy is defined as the expected surprise when drawing from p(x)p(x), which we're modeling as q(x)q(x). Our map is q(x)q(x) while p(x)p(x) is the territory.

H(p,q)=xp(x)log1q(x)H(p, q) = \sum_{x} p(x)\log{\frac{1}{q(x)}}

Now it should be intuitively clear that H(p,q)H(p,p)H(p, q) \ge H(p, p) because an imperfect model q(x)q(x) will (on average) surprise us more than the perfect model p(x)p(x).

To measure unnecessary surprise from approximating p(x)p(x) by q(x)q(x) we define

DKL(pq)=H(p,q)H(p,p)D_{\mathrm{KL}}(p \| q) = H(p, q) - H(p, p)

This is KL-divergence! The average additional surprise from our map approximating the territory.

Now it's time for an exercise, in the following figure q(x)q^{*}(x) is the Gaussian that minimizes DKL(pq)D_{\mathrm{KL}}(p\|q) or DKL(qp)D_{\mathrm{KL}}(q\|p), can you tell which is which?

Two graphs, one fits the map to the territory while the other fits the territory to the map.


Left is minimizing DKL(pq)D_{\mathrm{KL}}(p\|q) while the right is minimizing DKL(qp)D_{\mathrm{KL}}(q\|p).

Reason as follows:

  • If pp is the territory then the left qq^{*} is a better map (of pp) than the right qq^{*}.
  • If pp is the map, then the territory qq^{*} on the right leads to us being less surprised than the territory on the left, because on the on left pp will be very surprised at data in the middle, despite it being likely according to the territory qq^{*}.

On the left we fit the map to the territory, on the right we fit the territory to the map.