Vector calculus for ML

Added 2021-10-10, Modified 2022-03-12

Rules for differentiating neural networks


Let θRn\theta \in \mathbf{R}^n denote the nn parameters of our model, and let L:RnRL : \mathbf{R^n} \to \mathbf{R} denote the loss function. We want to find the gradient L(θ)\nabla L(\theta) to train our model.

First we need some simple rules

Chain rule: Let f:RnRmf : \mathbf{R}^n \to \mathbf{R}^m and g:RmRg : \mathbf{R}^m \to \mathbf{R} gxig(f(x))=k=1mgfkfkxi\frac{\partial g}{\partial x_i} g(f(x)) = \sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i} This looks a lot like the single variable chain rule, except for that summation. If you use the einstein summation convetion then the summation becomes implicit since we have a repeated index on the bottom and top

k=1mgfkfkxi=gfkfkxi\sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i} = \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i}

We won't be using this since it takes some time to get used to, but feel free to play around with it on your own!

We can also write this in matrix form, which is more useful when coding

gxig(f(x))=k=1mgfkfkxi=gfk\begin{aligned} \frac{\partial g}{\partial x_i} g(f(x)) &= \sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i} \\ &= \nabla g \cdot \nabla f_k \\ \end{aligned}

Where g\nabla g denotes g(f(x))\nabla g(f(x)). the full gradient g(f(x))\nabla g(f(x)) can be written with matrix multiplication

[f1TfmT]g=[f1TgfmTg]\begin{bmatrix} \nabla f_1^T \\ \vdots \\ \nabla f_m^T \end{bmatrix} \nabla g = \begin{bmatrix} \nabla f_1^T \nabla g \\ \vdots \\ \nabla f_m^T \nabla g \end{bmatrix}

This matrix is the jacobian! If we let J\mathbf{J} denote the jacobian of ff at xx we have

g(f(x))=Jg(f(x))\nabla g(f(x)) = \mathbf{J} \nabla g(f(x))

Intuitively this makes sense TODO

Product rule: Let f:RnRmf : \mathbf{R}^n \to \mathbf{R}^m and g:RnRmg : \mathbf{R}^n \to \mathbf{R}^m

xif(x)g(x)\frac{\partial}{\partial x_i} f(x) \cdot g(x)

Basic mnist

We can now take the gradient of a simple model, Let xR784x \in \mathbf{R}^{784} be the pixels in a picture of a handwriten digit. and yR10y \in \mathbf{R}^{10} be a onehot label, eg if the digit was a one we would have y=[0,1,,0]y = [0, 1, \dots, 0].

Our model will be a single matrix multiplication WxWx which can be thought of as weighing how much every pixel in the image contributes to each digit. Since we want the outputs of our model to be a probability distribution we define softmax (with SS as a shorthand)

softmax(x)=S(x)=exp(x)exp(xi)\text{softmax}(x) = S(x) = \frac{\exp(x)}{\sum \exp(x_i)}

The job of softmax is to take a vector in Rn\mathbf{R}^n and output a probability distribution (positive real numbers that sum to one)

We can define our loss, I'll use the squared distance from our predicted probability distribution y^\hat y to the true distribution yy

L(y^)=y^y2L(\hat y) = ||\hat y - y||^2

In all our model is y^=S(Wx)\hat y = S(Wx) with loss L(y^)=y^y2L(\hat y) = ||\hat y - y||^2

To train the model we need the gradient, so let's compute it using the chain rule.

LWi=Ly^y^(Wx)\frac{\partial L}{\partial W_i} = \frac{\partial L}{\partial \hat y} \frac{\partial \hat y}{\partial (Wx)}

Let's find both parts, first differente the loss

Ly^i=y^ij=110(y^jyj)2=y^i(y^iyi)2=2(y^iyi)\begin{aligned} \frac{\partial L}{\partial \hat y_i} &= \frac{\partial}{\partial \hat y_i} \sum_{j=1}^{10} (\hat y_j - y_j)^2 \\ &= \frac{\partial}{\partial \hat y_i} (\hat y_i - y_i)^2 \\ &= 2(\hat y_i - y_i) \end{aligned}

(This can also be done using the product rule for dot products)

Now let's differente softmax, since softmax is a function from Rn\mathbf{R}^n to Rn\mathbf{R}^n we'll need to pick the output SiS_i and the input xjx_j we're differentiating with respect to. First suppose iji \ne j, letting σ=kexp(xk)\sigma = \sum_k \exp(x_k) be shorthand we get

Sixj=xjexiσ=σxjexiexiσxjσ2=exiexjσ2=SiSj\begin{aligned} \frac{\partial S_i}{\partial x_j} &= \frac{\partial}{\partial x_j} \frac{e^{x_i}}{\sigma} \\ &= \frac{\sigma \frac{\partial }{\partial x_j}e^{x_i} - e^{x_i} \frac{\partial \sigma}{\partial x_j}}{\sigma^2} \\ &= \frac{-e^{x_i}e^{x_j}}{\sigma^2} \\ &= -S_iS_j \end{aligned}

Now let i=ji = j

Sixj=TODO\begin{aligned} \frac{\partial S_i}{\partial x_j} = \textbf{TODO} \end{aligned}

Without softmax

Let y^=Wx\hat y = Wx and L=y^y2L = ||\hat y - y||^2 we get (using the chain rule)

LWij=Ly^ky^kWij\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial \hat y_k} \frac{\partial \hat y_k}{\partial W_{ij}}

Since y^k/Wij=0\partial \hat y_k/\partial W_{ij} = 0 for iki \ne k most terms vanish and we get

Ly^iy^iWij=2(y^iyi)y^iWij=2(y^iyi)xj\frac{\partial L}{\partial \hat y_i} \frac{\partial \hat y_i}{\partial W_{ij}} = 2(\hat y_i - y_i) \cdot \frac{\partial \hat y_i}{\partial W_{ij}} = 2(\hat y_i - y_i) \cdot x_j

When translating this into numpy we can use np.outer(a,b)[i,j] = a[i]*b[j]

import numpy as np

W = np.ones((2,3))
x = np.array([-1, 1, 1])
y = np.array([1,2])

for _ in range(10):
    y_hat = W @ x

    loss = ((y_hat - y)**2).sum()
    print(loss)
    grad = np.outer(2*(y_hat - y), x)

    W -= 0.1 * grad

We get

1.0
0.16000000000000011
...
6.871947673606854e-08

It's learning!