Newton's Method

Created 2021-02-21, Edited 2022-07-20
A gentle introduction to root finding and it's applications


If you're ever confused by notation, see Notation

Root Finding

Newton's method is typically presented as a method for finding roots, values of xx where f(x)=0f(x) = 0. Many people don't realize how general this is

Newton's method also works in any number of dimensions, have a system of equations to solve? A cost function you want to minimize with multiple parameters?

To illustrate all the places Newton's method can be used, Let's look at some examples

Example: Square roots

The (obligatory) standard demo of newton's method is this magical formula

(x+s/x)/2(x + s/x)/2

Plug in a guess for s\sqrt s, and it gives you a better guess. If you plug in s=2s=2 and x=1.4x=1.4 you get 1.414281.41428 while the true value starts 1.414211.41421. 4 decimal places of accuracy!

Example: Roots of Polynomials

It's a pop-math fact that there does not exist a formula for the roots of polynomials with degree greater than 5. Newton's method works just fine, In fact finding roots of high degree polynomials is a solved problem with efficient algorithms existing for polynomials of degree greater than a thousand!

Finding roots of polynomials comes all the time in computer graphics, so this has a lot of applications. (Example: Given a point, finding the closest point on a quadratic Bézier curve reduces to finding the roots of a third degree polynomial)

Example: Projectile Aiming

At the end of this post we'll use these techniques to shoot down an enemy drone in a toy physics model.

Example: Landing Rockets

Newton's Method is used to solve convex optimization problems, which in turn are used to land rockets with the G-FOLD algorithm. See NASA's G-FOLD Diversion Test or SpaceX literally cheating.

Example: Robot Backflips

You've seen Boston Dynamics's Atlas robots right? Hm, I wonder what they use to control the robot, according to this paper lots of convex optimization and Inverse kinematics. Both of which involve Newton's Method!

Single variable Newton's Method

Hyped up? Let's start the math with Newton's Method in one variable. We have a function ff and some guess xx for a root which we want to improve. Newton's Method does this by finding a root of the linear approximation of ff around xx.

Let's break that down, we want to find yy such that f(y)=0f(y) = 0. We can use the tangent line approximation from calculus

f(y)f(x)+f(x)(yx)f(y) \approx f(x) + f'(x)(y-x)

Solve for when the tangent/linear approximation is zero

f(x)+f(x)(yx)=0    y=xf(x)f(x)f(x) + f'(x)(y-x) = 0 \implies y = x - \frac{f(x)}{f'(x)}

Exercise: Verify the formula for Square Roots (hint: consider f(x)=x2sf(x) = x^2 - s)


Let f(x)=x2sf(x) = x^2 - s, we have f(x)=2xf'(x) = 2x so assuming our previous guess was xx our new guess is

xf(x)f(x)=xx2s2x=2x2(x2s)2x=x2+s2x=(x+s/x)/2\begin{aligned} x - \frac{f(x)}{f'(x)} &= x - \frac{x^2 - s}{2x} \\ &= \frac{2x^2 - (x^2 - s)}{2x} \\ &= \frac{x^2 + s}{2x} \\ &= (x + s/x)/2 \end{aligned}

Here's a visualization of how this works

Exercise: Suppose we're trying to find f(x)=0f'(x) = 0, what's the geometric interpretation of xf(x)/f(x)x - f'(x)/f''(x)?


The point where the tangent line has a root, newton's method works by iteratively finding the root of the tangent approximation.

Don't get too attached to the visuals! Because next up is...

Multivariable Newton's Method

Before we do newton's method in multiple variables, we need to talk about differentiation in multiple variables. Consider a function ff from nn to mm dimensions, fully written in vector form f\mathbf f would be

f(x)=[f1(x1,,xn)fm(x1,,xn)]\mathbf f(\mathbf x) = \begin{bmatrix} f_1(\mathbf x_1,\dots, \mathbf x_n) \\ \vdots \\ f_m(\mathbf x_1,\dots,\mathbf x_n) \end{bmatrix}

Where f1(x1,,xn)f_1(\mathbf x_1,\dots,\mathbf x_n) denotes the first output coordinate, f2f_2 the second, etc. We use a bold x\mathbf x to indicate that x\mathbf x is a vector.

If you want to use fancy symbols, you could say xRn\mathbf x \in \mathbf R^n to mean x\mathbf x is in the set of length nn vectors with real components, and say f:RnRmf:\mathbf{R}^n\to \mathbf R^m to mean ff takes a vector in Rn\mathbf R^n to a vector in Rm\mathbf R^m.

Generalizing the derivative

Skip to Finding our new guess if you know jacobians, also see khan academy if my explanation doesn't make sense to you.

Consider tweaking x\mathbf x by adding a vector dx\mathrm d\mathbf x, and consider the first coordinate of the difference

(f(x+dx)f(x))1=dx(df1dx1++df1dx1)(f(\mathbf x + \mathrm d\mathbf x) - f(\mathbf x))_1 = \mathrm d \mathbf x \left(\frac{\mathrm df_1}{\mathrm d\mathbf x_1} + \dots + \frac{\mathrm df_1}{\mathrm d\mathbf x_1}\right)

We're just adding all the changes in ff as you can see if you imagine "cancelling" dx\mathrm d\mathbf x.

The same logic applies for the other coordinates. In fact this is the a matrix multiplication!

[df1/dx1df1/dxndfm/dx1dfm/dxn][dx1dxn]\begin{bmatrix} {\mathrm df_1}/{\mathrm d\mathbf x_1} & \dots & {\mathrm df_1}/{\mathrm d\mathbf x_n} \\ \vdots & \ddots & \vdots \\ {\mathrm df_m}/{\mathrm d\mathbf x_1} & \dots & {\mathrm df_m}/{\mathrm d\mathbf x_n} \end{bmatrix} \begin{bmatrix} \mathrm d\mathbf x_1 \\ \vdots \\ \mathrm d\mathbf x_n \end{bmatrix}

I'm going to denote the matrix on the left as Df(x)D_f(\mathbf x).

Exercise: If you know about basis vectors, think about the basis vectors of Df(x)D_f(\mathbf x). Watch this after

Meaning our linear approximation for f(y)f(\mathbf y) is

f(y)f(x)+Df(x)(yx)f(\mathbf y) \approx f(\mathbf x) + D_f(\mathbf x)(\mathbf y - \mathbf x)

Notice the similarity to the single variable case! The only difference is Df(x)D_f(\mathbf x) is a matrix instead a number.

Finding our new guess

Like before we solve to make the linear approximation zero, In exactly the same way except we use the matrix inverse Df(x)1D_f(\mathbf x)^{-1} instead of division.

f(x)+Df(x)(yx)=0    y=xDf(x)1f(x)f(\mathbf x) + D_f(\mathbf x)(\mathbf y-\mathbf x) = 0 \implies \mathbf y = \mathbf x - D_f(\mathbf x)^{-1}f(\mathbf x)

Which is our new guess!

Exercise: If we're trying to minimize a loss function g:RnRg : \mathbf R^n \to \mathbf R by making g(x)=0\nabla g(x) = 0. What's the formula for Newton's Method in this case? (hint: The jacobian of g\nabla g has a name!)


The jacobian of g\nabla g is the Hessian HgH_g of gg meaning we get

y=xHg1g(x)\mathbf y = \mathbf x - H_g^{-1}\nabla g(\mathbf x)

Notice how this is analgous to the single variable case, the Hessian is the multivariable analgous of the second derivative.

xf(x)f(x)x - \frac{f'(x)}{f''(x)}

Projectile Aiming

Let's put everything together in one semi-real world example. Shooting down drones. Suppose p(t)\mathbf p(t) gives the position of the enemy drone at a specific time, and we're allowed to pick a direction (θ,ϕ)(\theta, \phi) (in spherical coordinates) to shoot. The velocity, and acceleration, from gravity of the shot are given by

v(θ,ϕ)=5(sinθcosϕ,sinθsinϕ,cosθ),g=(0,0,9.8)\mathbf v(\theta,\phi) = 5(\sin\theta\cos\phi, \sin\theta\sin\phi, \cos \theta),\quad \mathbf g = (0, 0, -9.8)

The position of the shot at time tt is given by s(θ,ϕ,t)=tv(θ,ϕ)+12t2g\mathbf s(\theta,\phi,t) = t\mathbf v(\theta,\phi) + \frac 12 t^2\mathbf g.

We want to find θ,ϕ\theta, \phi and thitt_\text{hit} such that p(thit)=s(thit)\mathbf p(t_\text{hit}) = \mathbf s(t_\text{hit}). We can use multivariable Newton's Method for this! We need to find the jacobian of

f(θ,ϕ,thit)=p(thit)s(θ,ϕ,thit)\mathbf f(\theta,\phi,t_\text{hit}) = \mathbf p(t_\text{hit}) - \mathbf s(\theta,\phi,t_\text{hit})

I'm lazy, so I'll make JAX do it for me

import jax.numpy as jnp
from jax import jacobian

# p(t) is made up for illustration
def p(t):
    return jnp.array([t, jnp.sin(t), 0.1*t**2])
def v(theta, phi):
    return 5*jnp.array([
g = jnp.array([0, 0, -9.8])
def s(theta, phi, t):
    return t*v(theta, phi) + 0.5*t**2*g
def f(v):
    return p(v[2]) - s(v[0], v[1], v[2])

Df = jacobian(f)

def step(x):
    Dfx = Df(x)
    Dfx_inv = jnp.linalg.inv(Dfx)
    return x - Dfx_inv @ f(x)

# In a real application you would not hardcode this,
# setting it to hit the target ignoring movement and
# gravity would be a good initial guess.
guess = jnp.array([1,2,0.5])

for i in range(5):
    print(f'missed by {jnp.linalg.norm(f(guess)):.5f}')
    guess = step(guess)


missed by 1.98916
missed by 1.51225
missed by 0.26145
missed by 0.00024
missed by 0.00000


Here are other good explanations of Newton's Method

Here's some programming resources


Rn\mathbf R^n denotes the set of length nn vectors with real components, (2,π)R2(\sqrt 2,\pi) \in \mathbf R^2 for example

Bold variables x\mathbf x and y\mathbf y denote vectors, x1\mathbf x_1 denotes the first component of x\mathbf x.

I write vectors in two ways, but they represent the same object

(x1,,xn)=[x1xn](\mathbf x_1,\dots,\mathbf x_n) = \begin{bmatrix} \mathbf x_1 \\ \vdots \\ \mathbf x_n \end{bmatrix}

Sometimes I get lazy and write g(x)=g(x1,x2)g(\mathbf{x}) = g(\mathbf{x_1}, \mathbf{x_2}) when gg takes in a vector in R2\mathbf{R}^2, technically I should write g(x)=g((x1,x2))g(\mathbf{x}) = g((\mathbf{x_1}, \mathbf{x_2})) since gg doesn't take two numbers, it takes one vector x=(x1,x2)\mathbf x=(\mathbf x_1,\mathbf x_2).