Vector calculus for ML

Added 2021-10-10, Modified 2022-03-12

Rules for differentiating neural networks

Let $\theta \in \mathbf{R}^n$ denote the $n$ parameters of our model, and let $L : \mathbf{R^n} \to \mathbf{R}$ denote the loss function. We want to find the gradient $\nabla L(\theta)$ to train our model.

First we need some simple rules

Chain rule: Let $f : \mathbf{R}^n \to \mathbf{R}^m$ and $g : \mathbf{R}^m \to \mathbf{R}$ $\frac{\partial g}{\partial x_i} g(f(x)) = \sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i}$ This looks a lot like the single variable chain rule, except for that summation. If you use the einstein summation convetion then the summation becomes implicit since we have a repeated index on the bottom and top

\sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i} = \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i}

We won't be using this since it takes some time to get used to, but feel free to play around with it on your own!

We can also write this in matrix form, which is more useful when coding

\begin{aligned} \frac{\partial g}{\partial x_i} g(f(x)) &= \sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i} \\ &= \nabla g \cdot \nabla f_k \\ \end{aligned}

Where $\nabla g$ denotes $\nabla g(f(x))$ . the full gradient $\nabla g(f(x))$ can be written with matrix multiplication

\begin{bmatrix} \nabla f_1^T \\ \vdots \\ \nabla f_m^T \end{bmatrix} \nabla g = \begin{bmatrix} \nabla f_1^T \nabla g \\ \vdots \\ \nabla f_m^T \nabla g \end{bmatrix}

This matrix is the jacobian! If we let $\mathbf{J}$ denote the jacobian of $f$ at $x$ we have

\nabla g(f(x)) = \mathbf{J} \nabla g(f(x))

Intuitively this makes sense TODO

Product rule: Let $f : \mathbf{R}^n \to \mathbf{R}^m$ and $g : \mathbf{R}^n \to \mathbf{R}^m$

\frac{\partial}{\partial x_i} f(x) \cdot g(x)

Basic mnist

We can now take the gradient of a simple model, Let $x \in \mathbf{R}^{784}$ be the pixels in a picture of a handwriten digit. and $y \in \mathbf{R}^{10}$ be a onehot label, eg if the digit was a one we would have $y = [0, 1, \dots, 0]$ .

Our model will be a single matrix multiplication $Wx$ which can be thought of as weighing how much every pixel in the image contributes to each digit. Since we want the outputs of our model to be a probability distribution we define softmax (with $S$ as a shorthand)

\text{softmax}(x) = S(x) = \frac{\exp(x)}{\sum \exp(x_i)}

The job of softmax is to take a vector in $\mathbf{R}^n$ and output a probability distribution (positive real numbers that sum to one)

We can define our loss, I'll use the squared distance from our predicted probability distribution $\hat y$ to the true distribution $y$

L(\hat y) = ||\hat y - y||^2

In all our model is $\hat y = S(Wx)$ with loss $L(\hat y) = ||\hat y - y||^2$

To train the model we need the gradient, so let's compute it using the chain rule.

\frac{\partial L}{\partial W_i} = \frac{\partial L}{\partial \hat y} \frac{\partial \hat y}{\partial (Wx)}

Let's find both parts, first differente the loss

\begin{aligned} \frac{\partial L}{\partial \hat y_i} &= \frac{\partial}{\partial \hat y_i} \sum_{j=1}^{10} (\hat y_j - y_j)^2 \\ &= \frac{\partial}{\partial \hat y_i} (\hat y_i - y_i)^2 \\ &= 2(\hat y_i - y_i) \end{aligned}

(This can also be done using the product rule for dot products)

Now let's differente softmax, since softmax is a function from $\mathbf{R}^n$ to $\mathbf{R}^n$ we'll need to pick the output $S_i$ and the input $x_j$ we're differentiating with respect to. First suppose $i \ne j$ , letting $\sigma = \sum_k \exp(x_k)$ be shorthand we get

\begin{aligned} \frac{\partial S_i}{\partial x_j} &= \frac{\partial}{\partial x_j} \frac{e^{x_i}}{\sigma} \\ &= \frac{\sigma \frac{\partial }{\partial x_j}e^{x_i} - e^{x_i} \frac{\partial \sigma}{\partial x_j}}{\sigma^2} \\ &= \frac{-e^{x_i}e^{x_j}}{\sigma^2} \\ &= -S_iS_j \end{aligned}

Now let $i = j$

\begin{aligned} \frac{\partial S_i}{\partial x_j} = \textbf{TODO} \end{aligned}

Without softmax

Let $\hat y = Wx$ and $L = ||\hat y - y||^2$ we get (using the chain rule)

\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial \hat y_k} \frac{\partial \hat y_k}{\partial W_{ij}}

Since $\partial \hat y_k/\partial W_{ij} = 0$ for $i \ne k$ most terms vanish and we get

\frac{\partial L}{\partial \hat y_i} \frac{\partial \hat y_i}{\partial W_{ij}} = 2(\hat y_i - y_i) \cdot \frac{\partial \hat y_i}{\partial W_{ij}} = 2(\hat y_i - y_i) \cdot x_j

When translating this into numpy we can use np.outer(a,b)[i,j] = a[i]*b[j]

import numpy as np

W = np.ones((2,3))
x = np.array([-1, 1, 1])
y = np.array([1,2])

for _ in range(10):
    y_hat = W @ x

    loss = ((y_hat - y)**2).sum()
    print(loss)
    grad = np.outer(2*(y_hat - y), x)

    W -= 0.1 * grad

We get

1.0
0.16000000000000011
...
6.871947673606854e-08

It's learning!