KL Divergence Made Easy

Added 2022-07-28

KL-divergence is just the expected additional surprise from the map territory mismatch

The cross entropy is defined as the expected surprise when drawing from $p(x)$ , which we're modeling as $q(x)$ . Our map is $q(x)$ while $p(x)$ is the territory.

H(p, q) = \sum_{x} p(x)\log{\frac{1}{q(x)}}

Now it should be intuitively clear that $H(p, q) \ge H(p, p)$ because an imperfect model $q(x)$ will (on average) surprise us more than the perfect model $p(x)$ .

To measure unnecessary surprise from approximating $p(x)$ by $q(x)$ we define

D_{\mathrm{KL}}(p \| q) = H(p, q) - H(p, p)

This is KL-divergence! The average additional surprise from our map approximating the territory.

Now it's time for an exercise, in the following figure $q^{*}(x)$ is the Gaussian that minimizes $D_{\mathrm{KL}}(p\|q)$ or $D_{\mathrm{KL}}(q\|p)$ , can you tell which is which?

Two graphs, one fits the map to the territory while the other fits the territory to the map.

Answer

Left is minimizing $D_{\mathrm{KL}}(p\|q)$ while the right is minimizing $D_{\mathrm{KL}}(q\|p)$ .

Reason as follows:

If $p$ is the territory then the left $q^{*}$ is a better map (of $p$ ) than the right $q^{*}$ .
If $p$ is the map, then the territory $q^{*}$ on the right leads to us being less surprised than the territory on the left, because on the on left $p$ will be very surprised at data in the middle, despite it being likely according to the territory $q^{*}$ .

On the left we fit the map to the territory, on the right we fit the territory to the map.