Newton's Method

Added 2022-02-17, Modified 2022-07-20

A gentle introduction to root finding and its applications

Prerequisites

Familiarity with derivatives
Familiarity with vectors (adding them and multiplying them by numbers)
Knowledge of matrix multiplication and matrix inverses (you can manage without, but you’ll miss some stuff)

If you’re ever confused by notation, see Notation

Root Finding

Newton’s method is typically presented as a method for finding roots, values of $x$ where $f(x) = 0$ . Many people don’t realize how general this is

Solving $g(x) = h(x)$ is the same as solving $g(x) - h(x) = 0$ . Root finding can be used to solve equations.
Finding the minimum/maximum of some function $f$ requires solving $f'(x) = 0$ . Root finding strikes again

Newton’s method also works in any number of dimensions, have a system of equations to solve? A cost function you want to minimize with multiple parameters?

To illustrate all the places Newton’s method can be used, Let’s look at some examples

Example: Square roots

The (obligatory) standard demo of newton’s method is this magical formula

(x + s/x)/2

Plug in a guess for $\sqrt s$ , and it gives you a better guess. If you plug in $s=2$ and $x=1.4$ you get $1.41428$ while the true value starts $1.41421$ . 4 decimal places of accuracy!

Example: Roots of Polynomials

It’s a pop-math fact that there does not exist a formula for the roots of polynomials with degree greater than 5. Newton’s method works just fine, In fact finding roots of high degree polynomials is a solved problem with efficient algorithms existing for polynomials of degree greater than a thousand!

Finding roots of polynomials comes all the time in computer graphics, so this has a lot of applications. (Example: Given a point, finding the closest point on a quadratic Bézier curve reduces to finding the roots of a third degree polynomial)

Example: Projectile Aiming

At the end of this post we’ll use these techniques to shoot down an enemy drone in a toy physics model.

Example: Landing Rockets

Newton’s Method is used to solve convex optimization problems, which in turn are used to land rockets with the G-FOLD algorithm. See NASA’s G-FOLD Diversion Test or SpaceX literally cheating.

Example: Robot Backflips

You’ve seen Boston Dynamics’s Atlas robots right? Hm, I wonder what they use to control the robot, according to this paper lots of convex optimization and Inverse kinematics. Both of which involve Newton’s Method!

Single variable Newton’s Method

Hyped up? Let’s start the math with Newton’s Method in one variable. We have a function $f$ and some guess $x$ for a root which we want to improve. Newton’s Method does this by finding a root of the linear approximation of $f$ around $x$ .

Let’s break that down, we want to find $y$ such that $f(y) = 0$ . We can use the tangent line approximation from calculus

f(y) \approx f(x) + f'(x)(y-x)

Solve for when the tangent/linear approximation is zero

f(x) + f'(x)(y-x) = 0 \implies y = x - \frac{f(x)}{f'(x)}

Exercise: Verify the formula for Square Roots (hint: consider $f(x) = x^2 - s$ )

Solution

Let $f(x) = x^2 - s$ , we have $f'(x) = 2x$ so assuming our previous guess was $x$ our new guess is

\begin{aligned} x - \frac{f(x)}{f'(x)} &= x - \frac{x^2 - s}{2x} \\ &= \frac{2x^2 - (x^2 - s)}{2x} \\ &= \frac{x^2 + s}{2x} \\ &= (x + s/x)/2 \end{aligned}

Here’s a visualization of how this works

Exercise: Suppose we’re trying to find $f'(x) = 0$ , what’s the geometric interpretation of $x - f'(x)/f''(x)$ ?

Solution

The point where the tangent line has a root, newton’s method works by iteratively finding the root of the tangent approximation.

Don’t get too attached to the visuals! Because next up is…

Multivariable Newton’s Method

Before we do newton’s method in multiple variables, we need to talk about differentiation in multiple variables. Consider a function $f$ from $n$ to $m$ dimensions, fully written in vector form $\mathbf f$ would be

\mathbf f(\mathbf x) = \begin{bmatrix} f_1(\mathbf x_1,\dots, \mathbf x_n) \\ \vdots \\ f_m(\mathbf x_1,\dots,\mathbf x_n) \end{bmatrix}

Where $f_1(\mathbf x_1,\dots,\mathbf x_n)$ denotes the first output coordinate, $f_2$ the second, etc. We use a bold $\mathbf x$ to indicate that $\mathbf x$ is a vector.

If you want to use fancy symbols, you could say $\mathbf x \in \mathbf R^n$ to mean $\mathbf x$ is in the set of length $n$ vectors with real components, and say $f:\mathbf{R}^n\to \mathbf R^m$ to mean $f$ takes a vector in $\mathbf R^n$ to a vector in $\mathbf R^m$ .

Generalizing the derivative

Skip to Finding our new guess if you know jacobians, also see khan academy if my explanation doesn’t make sense to you.

Consider tweaking $\mathbf x$ by adding a vector $\mathrm d\mathbf x$ , and consider the first coordinate of the difference

(f(\mathbf x + \mathrm d\mathbf x) - f(\mathbf x))_1 = \mathrm d \mathbf x \left(\frac{\mathrm df_1}{\mathrm d\mathbf x_1} + \dots + \frac{\mathrm df_1}{\mathrm d\mathbf x_1}\right)

We’re just adding all the changes in $f$ as you can see if you imagine “cancelling” $\mathrm d\mathbf x$ .

The same logic applies for the other coordinates. In fact this is the a matrix multiplication!

\begin{bmatrix} {\mathrm df_1}/{\mathrm d\mathbf x_1} & \dots & {\mathrm df_1}/{\mathrm d\mathbf x_n} \\ \vdots & \ddots & \vdots \\ {\mathrm df_m}/{\mathrm d\mathbf x_1} & \dots & {\mathrm df_m}/{\mathrm d\mathbf x_n} \end{bmatrix} \begin{bmatrix} \mathrm d\mathbf x_1 \\ \vdots \\ \mathrm d\mathbf x_n \end{bmatrix}

I’m going to denote the matrix on the left as $D_f(\mathbf x)$ .

Exercise: If you know about basis vectors, think about the basis vectors of $D_f(\mathbf x)$ . Watch this after

Meaning our linear approximation for $f(\mathbf y)$ is

f(\mathbf y) \approx f(\mathbf x) + D_f(\mathbf x)(\mathbf y - \mathbf x)

Notice the similarity to the single variable case! The only difference is $D_f(\mathbf x)$ is a matrix instead a number.

Finding our new guess

Like before we solve to make the linear approximation zero, In exactly the same way except we use the matrix inverse $D_f(\mathbf x)^{-1}$ instead of division.

f(\mathbf x) + D_f(\mathbf x)(\mathbf y-\mathbf x) = 0 \implies \mathbf y = \mathbf x - D_f(\mathbf x)^{-1}f(\mathbf x)

Which is our new guess!

Exercise: If we’re trying to minimize a loss function $g : \mathbf R^n \to \mathbf R$ by making $\nabla g(x) = 0$ . What’s the formula for Newton’s Method in this case? (hint: The jacobian of $\nabla g$ has a name!)

Solution

The jacobian of $\nabla g$ is the Hessian $H_g$ of $g$ meaning we get

\mathbf y = \mathbf x - H_g^{-1}\nabla g(\mathbf x)

Notice how this is analgous to the single variable case, the Hessian is the multivariable analgous of the second derivative.

x - \frac{f'(x)}{f''(x)}

Projectile Aiming

Let’s put everything together in one semi-real world example. Shooting down drones. Suppose $\mathbf p(t)$ gives the position of the enemy drone at a specific time, and we’re allowed to pick a direction $(\theta, \phi)$ (in spherical coordinates) to shoot. The velocity, and acceleration, from gravity of the shot are given by

\mathbf v(\theta,\phi) = 5(\sin\theta\cos\phi, \sin\theta\sin\phi, \cos \theta),\quad \mathbf g = (0, 0, -9.8)

The position of the shot at time $t$ is given by $\mathbf s(\theta,\phi,t) = t\mathbf v(\theta,\phi) + \frac 12 t^2\mathbf g$ .

We want to find $\theta, \phi$ and $t_\text{hit}$ such that $\mathbf p(t_\text{hit}) = \mathbf s(t_\text{hit})$ . We can use multivariable Newton’s Method for this! We need to find the jacobian of

\mathbf f(\theta,\phi,t_\text{hit}) = \mathbf p(t_\text{hit}) - \mathbf s(\theta,\phi,t_\text{hit})

I’m lazy, so I’ll make JAX do it for me

import jax.numpy as jnp
from jax import jacobian

# p(t) is made up for illustration
def p(t):
    return jnp.array([t, jnp.sin(t), 0.1*t**2])
def v(theta, phi):
    return 5*jnp.array([
        jnp.sin(theta)*jnp.cos(phi),
        jnp.sin(theta)*jnp.sin(phi),
        jnp.cos(theta)
    ])
g = jnp.array([0, 0, -9.8])
def s(theta, phi, t):
    return t*v(theta, phi) + 0.5*t**2*g
def f(v):
    return p(v[2]) - s(v[0], v[1], v[2])

Df = jacobian(f)

def step(x):
    Dfx = Df(x)
    Dfx_inv = jnp.linalg.inv(Dfx)
    return x - Dfx_inv @ f(x)


# In a real application you would not hardcode this,
# setting it to hit the target ignoring movement and
# gravity would be a good initial guess.
guess = jnp.array([1,2,0.5])

for i in range(5):
    print(f'missed by {jnp.linalg.norm(f(guess)):.5f}')
    guess = step(guess)

Prints:

missed by 1.98916
missed by 1.51225
missed by 0.26145
missed by 0.00024
missed by 0.00000

Resources

Here are other good explanations of Newton’s Method

Visually Explained: Newton’s Method in Optimization (watch all his videos)
Newton’s Fractal (which Newton knew nothing about) (likewise ^)

Here’s some programming resources

JAX Python autodiff
Zygote.jl Julia autodiff
How ∂P is enabling new science in Julia has a nice demo of applying these techniques to aim a trebuchet

Notation

$\mathbf R^n$ denotes the set of length $n$ vectors with real components, $(\sqrt 2,\pi) \in \mathbf R^2$ for example

Bold variables $\mathbf x$ and $\mathbf y$ denote vectors, $\mathbf x_1$ denotes the first component of $\mathbf x$ .

I write vectors in two ways, but they represent the same object

(\mathbf x_1,\dots,\mathbf x_n) = \begin{bmatrix} \mathbf x_1 \\ \vdots \\ \mathbf x_n \end{bmatrix}

Sometimes I get lazy and write $g(\mathbf{x}) = g(\mathbf{x_1}, \mathbf{x_2})$ when $g$ takes in a vector in $\mathbf{R}^2$ , technically I should write $g(\mathbf{x}) = g((\mathbf{x_1}, \mathbf{x_2}))$ since $g$ doesn’t take two numbers, it takes one vector $\mathbf x=(\mathbf x_1,\mathbf x_2)$ .