### Newton's Method

A gentle introduction to root finding and it's applications

## Prerequisites

• Familiarity with derivatives
• Familiarity with vectors (adding them and multiplying them by numbers)
• Knowledge of matrix multiplication and matrix inverses (you can manage without, but you'll miss some stuff)

If you're ever confused by notation, see Notation

## Root Finding

Newton's method is typically presented as a method for finding roots, values of $x$ where $f(x) = 0$. Many people don't realize how general this is

• Solving $g(x) = h(x)$ is the same as solving $g(x) - h(x) = 0$. Root finding can be used to solve equations.
• Finding the minimum/maximum of some function $f$ requires solving $f'(x) = 0$. Root finding strikes again

Newton's method also works in any number of dimensions, have a system of equations to solve? A cost function you want to minimize with multiple parameters?

To illustrate all the places Newton's method can be used, Let's look at some examples

### Example: Square roots

The (obligatory) standard demo of newton's method is this magical formula

$(x + s/x)/2$

Plug in a guess for $\sqrt s$, and it gives you a better guess. If you plug in $s=2$ and $x=1.4$ you get $1.41428$ while the true value starts $1.41421$. 4 decimal places of accuracy!

### Example: Roots of Polynomials

It's a pop-math fact that there does not exist a formula for the roots of polynomials with degree greater than 5. Newton's method works just fine, In fact finding roots of high degree polynomials is a solved problem with efficient algorithms existing for polynomials of degree greater than a thousand!

Finding roots of polynomials comes all the time in computer graphics, so this has a lot of applications. (Example: Given a point, finding the closest point on a quadratic Bézier curve reduces to finding the roots of a third degree polynomial)

### Example: Projectile Aiming

At the end of this post we'll use these techniques to shoot down an enemy drone in a toy physics model.

### Example: Landing Rockets

Newton's Method is used to solve convex optimization problems, which in turn are used to land rockets with the G-FOLD algorithm. See NASA's G-FOLD Diversion Test or SpaceX literally cheating.

### Example: Robot Backflips

You've seen Boston Dynamics's Atlas robots right? Hm, I wonder what they use to control the robot, according to this paper lots of convex optimization and Inverse kinematics. Both of which involve Newton's Method!

## Single variable Newton's Method

Hyped up? Let's start the math with Newton's Method in one variable. We have a function $f$ and some guess $x$ for a root which we want to improve. Newton's Method does this by finding a root of the linear approximation of $f$ around $x$.

Let's break that down, we want to find $y$ such that $f(y) = 0$. We can use the tangent line approximation from calculus

$f(y) \approx f(x) + f'(x)(y-x)$

Solve for when the tangent/linear approximation is zero

$f(x) + f'(x)(y-x) = 0 \implies y = x - \frac{f(x)}{f'(x)}$

Exercise: Verify the formula for Square Roots (hint: consider $f(x) = x^2 - s$)

Solution

Let $f(x) = x^2 - s$, we have $f'(x) = 2x$ so assuming our previous guess was $x$ our new guess is

\begin{aligned} x - \frac{f(x)}{f'(x)} &= x - \frac{x^2 - s}{2x} \\ &= \frac{2x^2 - (x^2 - s)}{2x} \\ &= \frac{x^2 + s}{2x} \\ &= (x + s/x)/2 \end{aligned}

Here's a visualization of how this works

Exercise: Suppose we're trying to find $f'(x) = 0$, what's the geometric interpretation of $x - f'(x)/f''(x)$?

Solution

The point where the tangent line has a root, newton's method works by iteratively finding the root of the tangent approximation.

Don't get too attached to the visuals! Because next up is...

## Multivariable Newton's Method

Before we do newton's method in multiple variables, we need to talk about differentiation in multiple variables. Consider a function $f$ from $n$ to $m$ dimensions, fully written in vector form $\mathbf f$ would be

$\mathbf f(\mathbf x) = \begin{bmatrix} f_1(\mathbf x_1,\dots, \mathbf x_n) \\ \vdots \\ f_m(\mathbf x_1,\dots,\mathbf x_n) \end{bmatrix}$

Where $f_1(\mathbf x_1,\dots,\mathbf x_n)$ denotes the first output coordinate, $f_2$ the second, etc. We use a bold $\mathbf x$ to indicate that $\mathbf x$ is a vector.

If you want to use fancy symbols, you could say $\mathbf x \in \mathbf R^n$ to mean $\mathbf x$ is in the set of length $n$ vectors with real components, and say $f:\mathbf{R}^n\to \mathbf R^m$ to mean $f$ takes a vector in $\mathbf R^n$ to a vector in $\mathbf R^m$.

### Generalizing the derivative

Skip to Finding our new guess if you know jacobians, also see khan academy if my explanation doesn't make sense to you.

Consider tweaking $\mathbf x$ by adding a vector $\mathrm d\mathbf x$, and consider the first coordinate of the difference

$(f(\mathbf x + \mathrm d\mathbf x) - f(\mathbf x))_1 = \mathrm d \mathbf x \left(\frac{\mathrm df_1}{\mathrm d\mathbf x_1} + \dots + \frac{\mathrm df_1}{\mathrm d\mathbf x_1}\right)$

We're just adding all the changes in $f$ as you can see if you imagine "cancelling" $\mathrm d\mathbf x$.

The same logic applies for the other coordinates. In fact this is the a matrix multiplication!

$\begin{bmatrix} {\mathrm df_1}/{\mathrm d\mathbf x_1} & \dots & {\mathrm df_1}/{\mathrm d\mathbf x_n} \\ \vdots & \ddots & \vdots \\ {\mathrm df_m}/{\mathrm d\mathbf x_1} & \dots & {\mathrm df_m}/{\mathrm d\mathbf x_n} \end{bmatrix} \begin{bmatrix} \mathrm d\mathbf x_1 \\ \vdots \\ \mathrm d\mathbf x_n \end{bmatrix}$

I'm going to denote the matrix on the left as $D_f(\mathbf x)$.

Exercise: If you know about basis vectors, think about the basis vectors of $D_f(\mathbf x)$. Watch this after

Meaning our linear approximation for $f(\mathbf y)$ is

$f(\mathbf y) \approx f(\mathbf x) + D_f(\mathbf x)(\mathbf y - \mathbf x)$

Notice the similarity to the single variable case! The only difference is $D_f(\mathbf x)$ is a matrix instead a number.

### Finding our new guess

Like before we solve to make the linear approximation zero, In exactly the same way except we use the matrix inverse $D_f(\mathbf x)^{-1}$ instead of division.

$f(\mathbf x) + D_f(\mathbf x)(\mathbf y-\mathbf x) = 0 \implies \mathbf y = \mathbf x - D_f(\mathbf x)^{-1}f(\mathbf x)$

Which is our new guess!

Exercise: If we're trying to minimize a loss function $g : \mathbf R^n \to \mathbf R$ by making $\nabla g(x) = 0$. What's the formula for Newton's Method in this case? (hint: The jacobian of $\nabla g$ has a name!)

Solution

The jacobian of $\nabla g$ is the Hessian $H_g$ of $g$ meaning we get

$\mathbf y = \mathbf x - H_g^{-1}\nabla g(\mathbf x)$

Notice how this is analgous to the single variable case, the Hessian is the multivariable analgous of the second derivative.

$x - \frac{f'(x)}{f''(x)}$

## Projectile Aiming

Let's put everything together in one semi-real world example. Shooting down drones. Suppose $\mathbf p(t)$ gives the position of the enemy drone at a specific time, and we're allowed to pick a direction $(\theta, \phi)$ (in spherical coordinates) to shoot. The velocity, and acceleration, from gravity of the shot are given by

$\mathbf v(\theta,\phi) = 5(\sin\theta\cos\phi, \sin\theta\sin\phi, \cos \theta),\quad \mathbf g = (0, 0, -9.8)$

The position of the shot at time $t$ is given by $\mathbf s(\theta,\phi,t) = t\mathbf v(\theta,\phi) + \frac 12 t^2\mathbf g$.

We want to find $\theta, \phi$ and $t_\text{hit}$ such that $\mathbf p(t_\text{hit}) = \mathbf s(t_\text{hit})$. We can use multivariable Newton's Method for this! We need to find the jacobian of

$\mathbf f(\theta,\phi,t_\text{hit}) = \mathbf p(t_\text{hit}) - \mathbf s(\theta,\phi,t_\text{hit})$

I'm lazy, so I'll make JAX do it for me

import jax.numpy as jnp
from jax import jacobian

# p(t) is made up for illustration
def p(t):
return jnp.array([t, jnp.sin(t), 0.1*t**2])
def v(theta, phi):
return 5*jnp.array([
jnp.sin(theta)*jnp.cos(phi),
jnp.sin(theta)*jnp.sin(phi),
jnp.cos(theta)
])
g = jnp.array([0, 0, -9.8])
def s(theta, phi, t):
return t*v(theta, phi) + 0.5*t**2*g
def f(v):
return p(v[2]) - s(v[0], v[1], v[2])

Df = jacobian(f)

def step(x):
Dfx = Df(x)
Dfx_inv = jnp.linalg.inv(Dfx)
return x - Dfx_inv @ f(x)

# In a real application you would not hardcode this,
# setting it to hit the target ignoring movement and
# gravity would be a good initial guess.
guess = jnp.array([1,2,0.5])

for i in range(5):
print(f'missed by {jnp.linalg.norm(f(guess)):.5f}')
guess = step(guess)

Prints:

missed by 1.98916
missed by 1.51225
missed by 0.26145
missed by 0.00024
missed by 0.00000

## Resources

Here are other good explanations of Newton's Method

Here's some programming resources

## Notation

$\mathbf R^n$ denotes the set of length $n$ vectors with real components, $(\sqrt 2,\pi) \in \mathbf R^2$ for example

Bold variables $\mathbf x$ and $\mathbf y$ denote vectors, $\mathbf x_1$ denotes the first component of $\mathbf x$.

I write vectors in two ways, but they represent the same object

$(\mathbf x_1,\dots,\mathbf x_n) = \begin{bmatrix} \mathbf x_1 \\ \vdots \\ \mathbf x_n \end{bmatrix}$

Sometimes I get lazy and write $g(\mathbf{x}) = g(\mathbf{x_1}, \mathbf{x_2})$ when $g$ takes in a vector in $\mathbf{R}^2$, technically I should write $g(\mathbf{x}) = g((\mathbf{x_1}, \mathbf{x_2}))$ since $g$ doesn't take two numbers, it takes one vector $\mathbf x=(\mathbf x_1,\mathbf x_2)$.