Let denote the parameters of our model, and let denote the loss function.
We want to find the gradient to train our model.
First we need some simple rules
This looks a lot like the single variable chain rule, except for that summation.
If you use the einstein summation convetion then the summation becomes implicit since we have a repeated index on the bottom and top
We won't be using this since it takes some time to get used to, but feel free to play around with it on your own!
We can also write this in matrix form, which is more useful when coding
Where denotes . the full gradient can be written with matrix multiplication
This matrix is the jacobian! If we let denote the jacobian of at we have
Intuitively this makes sense TODO
We can now take the gradient of a simple model, Let be the pixels in a picture of a handwriten digit. and be a onehot label, eg if the digit was a one we would have .
Our model will be a single matrix multiplication which can be thought of as weighing how much every pixel in the image contributes to each digit.
Since we want the outputs of our model to be a probability distribution we define softmax (with as a shorthand)
The job of softmax is to take a vector in and output a probability distribution (positive real numbers that sum to one)
We can define our loss, I'll use the squared distance from our predicted probability distribution to the true distribution
In all our model is with loss
To train the model we need the gradient, so let's compute it using the chain rule.
Let's find both parts, first differente the loss
(This can also be done using the product rule for dot products)
Now let's differente softmax, since softmax is a function from to we'll need to pick the output and the input we're differentiating with respect to. First suppose , letting be shorthand we get
Let and we get (using the chain rule)
Since for most terms vanish and we get
When translating this into numpy we can use
np.outer(a,b)[i,j] = a[i]*b[j]
import numpy as np
W = np.ones((2,3))
x = np.array([-1, 1, 1])
y = np.array([1,2])
for _ in range(10):
y_hat = W @ x
loss = ((y_hat - y)**2).sum()
grad = np.outer(2*(y_hat - y), x)
W -= 0.1 * grad