Let θ∈Rn denote the n parameters of our model, and let L:Rn→R denote the loss function.
We want to find the gradient ∇L(θ) to train our model.
First we need some simple rules
Chain rule:
Let f:Rn→Rm and g:Rm→R∂xi∂gg(f(x))=∑k=1m∂fk∂g∂xi∂fk
This looks a lot like the single variable chain rule, except for that summation.
If you use the einstein summation convetion then the summation becomes implicit since we have a repeated index on the bottom and top
k=1∑m∂fk∂g∂xi∂fk=∂fk∂g∂xi∂fk
We won't be using this since it takes some time to get used to, but feel free to play around with it on your own!
We can also write this in matrix form, which is more useful when coding
∂xi∂gg(f(x))=k=1∑m∂fk∂g∂xi∂fk=∇g⋅∇fk
Where ∇g denotes ∇g(f(x)). the full gradient ∇g(f(x)) can be written with matrix multiplication
⎣⎡∇f1T⋮∇fmT⎦⎤∇g=⎣⎡∇f1T∇g⋮∇fmT∇g⎦⎤
This matrix is the jacobian! If we let J denote the jacobian of f at x we have
∇g(f(x))=J∇g(f(x))
Intuitively this makes sense TODO
Product rule:
Let f:Rn→Rm and g:Rn→Rm
∂xi∂f(x)⋅g(x)
Basic mnist
We can now take the gradient of a simple model, Let x∈R784 be the pixels in a picture of a handwriten digit. and y∈R10 be a onehot label, eg if the digit was a one we would have y=[0,1,…,0].
Our model will be a single matrix multiplication Wx which can be thought of as weighing how much every pixel in the image contributes to each digit.
Since we want the outputs of our model to be a probability distribution we define softmax (with S as a shorthand)
softmax(x)=S(x)=∑exp(xi)exp(x)
The job of softmax is to take a vector in Rn and output a probability distribution (positive real numbers that sum to one)
We can define our loss, I'll use the squared distance from our predicted probability distribution y^ to the true distribution y
L(y^)=∣∣y^−y∣∣2
In all our model is y^=S(Wx) with loss L(y^)=∣∣y^−y∣∣2
To train the model we need the gradient, so let's compute it using the chain rule.
(This can also be done using the product rule for dot products)
Now let's differente softmax, since softmax is a function from Rn to Rn we'll need to pick the output Si and the input xj we're differentiating with respect to. First suppose i=j, letting σ=∑kexp(xk) be shorthand we get
When translating this into numpy we can use np.outer(a,b)[i,j] = a[i]*b[j]
import numpy as npW = np.ones((2,3))x = np.array([-1,1,1])y = np.array([1,2])for _ inrange(10): y_hat = W @ x loss =((y_hat - y)**2).sum()print(loss) grad = np.outer(2*(y_hat - y), x) W -=0.1* grad