Let $\theta \in \mathbf{R}^n$ denote the $n$ parameters of our model, and let $L : \mathbf{R^n} \to \mathbf{R}$ denote the loss function. We want to find the gradient $\nabla L(\theta)$ to train our model.

First we need some simple rules

**Chain rule:**
Let $f : \mathbf{R}^n \to \mathbf{R}^m$ and $g : \mathbf{R}^m \to \mathbf{R}$
$\frac{\partial g}{\partial x_i} g(f(x)) = \sum_{k=1}^m \frac{\partial g}{\partial f_k} \frac{\partial f_k}{\partial x_i}$
This looks a lot like the single variable chain rule, except for that summation.
If you use the einstein summation convetion then the summation becomes implicit since we have a repeated index on the bottom and top

We won't be using this since it takes some time to get used to, but feel free to play around with it on your own!

We can also write this in matrix form, which is more useful when coding

Where $\nabla g$ denotes $\nabla g(f(x))$. the full gradient $\nabla g(f(x))$ can be written with matrix multiplication

This matrix is the jacobian! If we let $\mathbf{J}$ denote the jacobian of $f$ at $x$ we have

Intuitively this makes sense **TODO**

**Product rule:**
Let $f : \mathbf{R}^n \to \mathbf{R}^m$ and $g : \mathbf{R}^n \to \mathbf{R}^m$

## Basic mnist

We can now take the gradient of a simple model, Let $x \in \mathbf{R}^{784}$ be the pixels in a picture of a handwriten digit. and $y \in \mathbf{R}^{10}$ be a onehot label, eg if the digit was a one we would have $y = [0, 1, \dots, 0]$.

Our model will be a single matrix multiplication $Wx$ which can be thought of as weighing how much every pixel in the image contributes to each digit. Since we want the outputs of our model to be a probability distribution we define softmax (with $S$ as a shorthand)

The job of softmax is to take a vector in $\mathbf{R}^n$ and output a probability distribution (positive real numbers that sum to one)

We can define our loss, I'll use the squared distance from our predicted probability distribution $\hat y$ to the true distribution $y$

In all our model is $\hat y = S(Wx)$ with loss $L(\hat y) = ||\hat y - y||^2$

To train the model we need the gradient, so let's compute it using the chain rule.

Let's find both parts, first differente the loss

(This can also be done using the product rule for dot products)

Now let's differente softmax, since softmax is a function from $\mathbf{R}^n$ to $\mathbf{R}^n$ we'll need to pick the output $S_i$ and the input $x_j$ we're differentiating with respect to. First suppose $i \ne j$, letting $\sigma = \sum_k \exp(x_k)$ be shorthand we get

Now let $i = j$

## Without softmax

Let $\hat y = Wx$ and $L = ||\hat y - y||^2$ we get (using the chain rule)

Since $\partial \hat y_k/\partial W_{ij} = 0$ for $i \ne k$ most terms vanish and we get

When translating this into numpy we can use `np.outer(a,b)[i,j] = a[i]*b[j]`

```
import numpy as np
W = np.ones((2,3))
x = np.array([-1, 1, 1])
y = np.array([1,2])
for _ in range(10):
y_hat = W @ x
loss = ((y_hat - y)**2).sum()
print(loss)
grad = np.outer(2*(y_hat - y), x)
W -= 0.1 * grad
```

We get

```
1.0
0.16000000000000011
...
6.871947673606854e-08
```

It's learning!