Optimal Polynomial Approximation

Added 2022-03-13, Modified 2022-07-20

An introduction to linear algebra and least squares on functions


Prerequisites

Introduction

If you're confused see Notation, if that doesn't clear it up contact me so I can fix the post to be more understandable.

Demo

This polynomial

0.005643x50.15527x3+0.98786x0.005643 x^{5} - 0.15527 x^{3} + 0.98786 x

Is the best fifth degree polynomial approximation for sinx\sin x over [π,π][-\pi,\pi] (within roundoff of course).

Do you see the two graphs? They're so close it's hard to tell there are two! If you compute the squared error (where p(x)p(x) denotes our polynomial approximation)

ππ(sin(x)p(x))2dx=0.000116\int_{-\pi}^\pi (\sin(x) - p(x))^2 dx = 0.000116

We find it's a really good approximation for sin(x)\sin(x), taylor polynomials don't even compare with an error of 0.11870.1187.

The best part? These optimal polynomial approximations can be efficiently found for any function, in fact, for complicated functions they're easier to find then taylor series!

Here's the code if you want to play with it before learning the math

Done? Okay, then get ready for...

Linear Algebra

Though a bit of an exaggeration, it can be said that a mathematical problem can be solved only if it can be reduced to a calculation in linear algebra. Thomas Garrity

Before you truely understand the demo we've got to learn some linear algebra! To be specific you need to know

I won't be covering linear transformations (the heart of linear algebra) at all, because they aren't needed here.

Important: Attempt the exercises and read the solutions. I've put a lot of content in the solutions so skipping them will make reading this impossible.

The standard space Rn\mathbf R^n

Before we get to polynomials, let's consider "vanilla" linear algebra, where every vector a list of numbers in Rn\mathbf R^n.

Adding and scaling vectors works as you'd expect

(x1,x2)+(y1,y2)=(x1+y1,x2+y2)λ(x1,x2)=(λx1,λx2)\begin{aligned} (x_1,x_2) + (y_1,y_2) &= (x_1+y_1,x_2+y_2) \\ \lambda (x_1,x_2) &= (\lambda x_1, \lambda x_2) \end{aligned}

In abstract linear algebra vectors are any object you can add and scale like this.

Exercise: Functions and Polynomials are also vectors, convince yourself of this

Solution

We can add functions and multiply by scalars and the result will still be a function, same with polynomials

The set vectors are contained in is called a vector space. A vector space has a few requirements

  1. Closed under addition: If vVv \in V and wVw \in V then w+vVw + v \in V
  2. Closed under scaling: If vVv \in V and λR\lambda \in \mathbf R then λvV\lambda v \in V

Without these properties working in a vector space would be impossible, imagine adding two vectors and not knowing if the result is also in the space!

Exercise: Why must the zero vector be in every vector space?

Solution

Set λ=0\lambda = 0 and use (2).

Exercise: Show the space of polynomials of degree exactly nn is not a vector space

Solution

It doesn't contain the zero polynomial.

Exercise: Fix the problem shown in the previous exercise to get us a vector space of polynomials

Solution

Consider the space of polynomials with degree at most nn (denoted Pn\mathcal P_n). This satisfies the vector space rules since

  1. Adding polynomials of degree at most nn results in polynomials with degree at most nn, so we're closed under addition
  2. Scaling polynomials can only decrease their degree, so we're closed under scaling

Subspaces

A vector space UU is said to be a subspace of VV if UU is contained in VV, ie. UVU \subseteq V.

Exercise: Show the space of polynomials P\mathcal P (of any degree) is a subspace of all real functions (denoted RR\mathbf R^{\mathbf R})

Solution

The set of polynomials P\mathcal P is a vector space, and every polynomial is also a function, in set notation PRR\mathcal P \subseteq \mathbf{R}^{\mathbf R}. Hence P\mathcal P is a subspace of RR\mathbf R^{\mathbf R}.

Exercise: Interpret our original problem of finding nnth degree polynomial approximations as a special case of: given a vVv \in V find the closest uUu \in U to vv (often called the projection of vv onto UU)

Solution

Consider V=C[a,b]V = \mathbf C^{[a,b]} (the space of functions continuous on [a,b][a,b]) and U=PnU = \mathcal P_n. then sinV\sin \in V and we want the closest nnth degree polynomial pPnp \in \mathcal P_n to sin(x)\sin(x).

The dot product

To compute the dot product between two vectors in Rn\mathbf R^n (lists of numbers) we multiply components then add up, ie.

(1,2),(3,4)=13+24=11\langle (1,2), (3,4)\rangle = 1\cdot 3 + 2\cdot 4 = 11

As a special case the dot product of a vector with itself gives the length squared v,v=v2\langle v,v\rangle = \|v\|^2

The dot product has a very satisfying geometric interpretation: If θ\theta is the angle between vv and ww then

v,w=vwcosθ\langle v, w\rangle = \|v\| \|w\|\cos\theta

Geometrically this means the dot product is the length of projecting vv onto the line through ww then scaling by w|w| (or the other way around). See this video for why.

Exercise: Convince yourself that u+v,w=u,w+v,w\langle u+v,w\rangle = \langle u,w\rangle+\langle v,w\rangle and λv,w=λv,w\lambda \langle v,w\rangle = \langle \lambda v, w\rangle when \langle\cdot\rangle denotes the dot product.

Exercise: Using the projection interpretation, what is the projection of vv onto the line through uu?

Solution

First normalize uu by dividing by it's length, then take the dot product and scale in the direction of uu, which we normalize again to avoid unwanted scaling

v,u/u(u/u)\langle v, u/\|u\|\rangle \cdot (u/\|u\|)

This is usually written as (after using linearity of \langle\cdot\rangle)

v,uu,uu\frac{\langle v, u\rangle}{\langle u, u\rangle}u

Exercise: Expand vw2\|v-w\|^2 in terms of dot products, then use the geometric interpretation to prove the law of cosines.

Solution

Expand using linearity just like you would with polynomials!

vw2=vw,vw=v,v2v,w+w,w=v2+w22v,w\begin{aligned} \|v-w\|^2 &= \langle v-w,v-w\rangle \\ &= \langle v, v\rangle - 2\langle v, w \rangle + \langle w, w\rangle \\ &= \|v\|^2 + \|w\|^2 - 2\langle v, w\rangle \end{aligned}

Using the geometric interpretation this becomes

vw2=v2+w22vwcosθ\|v-w\|^2 = \|v\|^2 + \|w\|^2 - 2\|v\|\|w\|\cos\theta

Which is the law of cosines!

Exercise: Compute the area of the parallelogram formed between vv and ww using dot products

Solution

Consider this diagram

The height is h=vsinθh = \|v\|\sin\theta meaning the area is vwsinθ\|v\|\|w\|\sin\theta. Writing this in terms of dot products we get

vwsinθ=vw1cos2θ=vw1(v,wvw)2=v2w2v,w2\begin{aligned} \|v\|\|w\|\sin\theta &= \|v\|\|w\|\sqrt{1 - \cos^2\theta} \\ &= \|v\|\|w\|\sqrt{1 - \left(\frac{\langle v,w\rangle}{\|v\|\|w\|}\right)^2} \\ &= \sqrt{\|v\|^2\|w\|^2 - \langle v, w\rangle^2} \end{aligned}

Exercise (optional): Using the result from the previous exercise, derive the formula for the determinant of a 2x2 matrix (if you know what this means)

Solution

Suppose we have the matrix

A=[abcd]A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}

Consider the basis vectors as v=(a,b)v = (a,b) and w=(c,d)w = (c,d). Then the area between them is the determinant (watch these if that doesn't make sense) meaning

detA=v2w2v,w2=(a2+b2)(c2+d2)(ac+bd)2=(ac)2+(ad)2+(bc)2+(bd)2((ac)2+2acbd+(bd)2)=(ad)2+(bc)22acbd=(adbc)2=adbc\begin{aligned} \det A &= \sqrt{\|v\|^2 \|w\|^2 - \langle v, w\rangle^2} \\ &= \sqrt{(a^2 + b^2)(c^2 + d^2) - (ac + bd)^2} \\ &= \sqrt{(ac)^2 + (ad)^2 + (bc)^2 + (bd)^2 - ((ac)^2 + 2acbd + (bd)^2)} \\ &= \sqrt{(ad)^2 + (bc)^2 - 2acbd} \\ &= \sqrt{(ad - bc)^2} \\ &= ad - bc \end{aligned}

General inner products

Like vectors, there can be many kinds of inner products, for instance on the space C[a,b]\mathbf C^{[a,b]} of continuous functions on [a,b][a,b] we can define an inner product

f,g=abf(x)g(x)dx\langle f, g\rangle = \int_a^b f(x)g(x)dx

Intuitively this is the continuous version of the dot product, where we treat functions as infinite-dimensional vectors.

Like we abstracted the notion of a "vector" we also abstract the notion of an inner product to be anything that obeys a set of rules

  1. Positivity v,v0\langle v, v\rangle \ge 0
  2. Definiteness v,v=0\langle v, v\rangle = 0 if and only if v=0v = 0
  3. Symmetry v,w=w,v\langle v,w\rangle = \langle w, v\rangle
  4. Additive u+v,w=u,w+v,w\langle u+v, w\rangle = \langle u,w\rangle + \langle v, w\rangle
  5. Multiplicative λv,w=λv,w\lambda\langle v,w\rangle = \langle\lambda v,w\rangle

Exercise: Convince yourself that both inner products I've given so far satisfy these properties.

To generalize length we define the norm of a vector vv as v=v,v\|v\| = \sqrt{\langle v,v\rangle}

Exercise: Generalize the law of cosines to any inner product \langle\cdot\rangle.

Solution

Exactly the same steps as with the law of cosines

vw2=vw,vw=v,v2v,w+w,w=v2+w22v,w\begin{aligned} \|v - w\|^2 &= \langle v-w, v-w\rangle \\ &= \langle v,v\rangle - 2\langle v,w\rangle + \langle w,w\rangle \\ &= \|v\|^2 + \|w\|^2 - 2\langle v,w\rangle \end{aligned}

If we were in Rn\mathbf R^n we could use the geometric interpretation to write v,w=vwcosθ\langle v,w\rangle = \|v\|\|w\|\cos\theta like we did for the law of cosines.

Exercise: In the geometric interpretation we showed the dot product can be used to compute the cosine of the angle between two vectors, ponder the infinite-dimensional analog of this in function space (the cosine similarity of two functions!)

Solution

With dot products we had

cosθ=v,wv,vw,w\cos\theta = \frac{\langle v, w\rangle}{\sqrt{\langle v,v\rangle\cdot \langle w,w\rangle}}

Since the integral inner product is just a limit of dot products as the vectors get bigger and the samples become finer, it makes sense to define the cosine similarity between functions this way. We could also extract the angle using the inverse cosine if we felt like it, and talk about the angle between functions!

Exercise (hard): Why do we need ff and gg to be continuous and not just integrable? (hint: think about property 2)

Solution

The function

f(x)={0x01x=0f(x) = \begin{cases} 0 &{x \ne 0} \\ 1 &{x = 0} \end{cases}

Is integrable on [1,1][-1,1] with f,f=11f(x)2dx=0\langle f,f\rangle = \int_{-1}^1 f(x)^2dx = 0, but f0f \ne 0 breaking requirement 2.

Showing this can't happen with continuous functions is tricky, but it boils down to f(x)2>0f(x)^2 > 0 implying there's a neighborhood around xx with f(y)2>0f(y)^2 > 0 meaning it contributes nonzero area.

Orthogonality

We say two vectors vv and ww are orthogonal if v,w=0\langle v, w\rangle = 0

Exercise: Prove the pythagorean theorem for an arbitrary inner product, ie. prove that v,w=0\langle v,w\rangle = 0 implies v+w2=v2+w2\|v+w\|^2 = \|v\|^2 + \|w\|^2.

Solution

We expand using the linearity of inner products and cancel the 2v,w2\langle v,w\rangle term

v+w2=v+w,v+w=v,v+2v,w+w,w=v2+w2\begin{aligned} \|v+w\|^2 &= \langle v+w,v+w\rangle \\ &= \langle v,v\rangle + 2\langle v,w\rangle + \langle w,w\rangle \\ &= \|v\|^2 + \|w\|^2 \end{aligned}

Exercise: If \langle \cdot\rangle is the dot product and v,wRnv,w \in \mathbf R^n are nonzero vectors, show vv and ww are orthogonal if and only if they are perpendicular (hint: use the geometric interpretation)

Solution

To see this directly notice the projection is zero exactly when v,wv,w are perpendicular. (see this video)

We can also use the geometric formula

v,w=vwcosθ\langle v, w\rangle = \|v\|\|w\|\cos\theta

Because v,wv,w are nonzero we have v>0\|v\| > 0 and w>0\|w\| > 0 meaning this is only zero when cosθ=0\cos\theta = 0, which only happens when θ\theta is a multiple of 9090 degrees, ie. when vv and ww are perpendicular.

Bases

A basis for a vector space VV is a list v1,,vnv_1,\dots,v_n such that

  1. Spanning: Every vVv \in V can be written v=a1v1++anvnv = a_1v_1 + \dots + a_nv_n for some scalars a1,,ana_1,\dots,a_n.
  2. Independent: The representation of vv as a combination of v1,,vnv_1,\dots,v_n is unique.

Exercise: Convince yourself (1,0)(1,0) and (0,1)(0,1) form a basis for R2\mathbf R^2

Exercise: Show (1,2)(1, 2) and (0,2)(0, 2) form a basis of R2\mathbf R^2

Solution

Every (x,y)R2(x,y) \in \mathbf R^2 can be written

(x,y)=x(1,2)+y2x2(0,2)(x,y) = x\cdot(1,2) + \frac{y - 2x}{2}\cdot(0, 2)

Hence (1,2),(0,2)(1,2), (0,2) is spanning, to show independence note that

a1(1,2)+a2(0,2)=b1(1,2)+(0,2)b2a_1(1,2) + a_2(0,2) = b_1(1,2) + (0,2)b_2

Implies (via rearrangement)

(a1b1)(1,2)=(b2a2)(0,2)(a_1-b_1)(1,2) = (b_2-a_2)(0,2)

Since the first coordinates must equal this implies a1b1=0a_1-b_1 = 0, this implies the left hand side is zero meaning we must have (b2a2)(0,2)=0(b_2-a_2)(0, 2) = 0 as well, implying b2a2=0b_2 - a_2 = 0. Thus a1=b1a_1 = b_1 and a2=b2a_2 = b_2 and we have shown independence.

We say a basis is orthogonal if vj,vk=0\langle v_j, v_k\rangle = 0 when jkj \ne k. And we say a basis is orthonormal if it's orthogonal and vj,vj=1\langle v_j, v_j\rangle = 1 (ie. the basis vectors have unit length).

Orthogonalization

Given a basis v1,,vnv_1,\dots,v_n of VV it's possible to construct an orthonormal basis e1,,ene_1,\dots,e_n of VV.

Start by setting e1=v1/v1e_1 = v_1/\|v_1\| so e1,e1=1\langle e_1,e_1\rangle = 1. now we want e2e_2 in terms of v2v_2 such that e1,e2=0\langle e_1, e_2\rangle = 0.

We want this to be zero, so we subtract the projection of v2v_2 onto e1e_1 (see the geometrically interpretation for why I call it a projection)

e2=v2e1,v2e1e_2 = v_2 - \langle e_1, v_2\rangle e_1

Exercise: Verify e1,e2=0\langle e_1, e_2\rangle = 0 using linearity (hint: e1,v2\langle e_1,v_2\rangle is just a number)

Solution
e1,e2=e1,v2e1,v2e1=e1,v2e1,v2e1,e1=e1,v2e1,v2=0\begin{aligned} \langle e_1,e_2\rangle &= \langle e_1, v_2 - \langle e_1, v_2\rangle e_1\rangle \\ &= \langle e_1, v_2\rangle - \langle e_1, v_2\rangle \langle e_1, e_1\rangle \\ &= \langle e_1, v_2\rangle - \langle e_1, v_2\rangle \\ &= 0 \end{aligned}

I left out the step where we normalize e2e_2, but understand after we make it orthogonal we normalize it too, in programming terms e2e2/e2e_2 \leftarrow e_2/\|e_2\|.

In general we subtract the projection onto each of the previous basis vectors

ei=vij=1i1vi,ejeje_i = v_i - \sum_{j=1}^{i-1} \langle v_i, e_j\rangle e_j

Exercise: Show ei,ej=0\langle e_i, e_j\rangle = 0 (hint: without loss of generality assume j<ij \lt i)

Solution

Using the definition of eie_i and orthogonality gives

ej,ei=ej,vik=1i1vi,ekek=ej,vik=1i1vi,ekej,ek=ej,vivi,ejej,ej=ej,vivi,ej=0\begin{aligned} \langle e_j, e_i\rangle &= \left\langle e_j, v_i - \sum_{k=1}^{i-1} \langle v_i,e_k\rangle e_k\right\rangle \\ &= \langle e_j, v_i\rangle - \sum_{k=1}^{i-1} \langle v_i, e_k\rangle \langle e_j, e_k\rangle \\ &= \langle e_j, v_i\rangle - \langle v_i, e_j\rangle \langle e_j, e_j\rangle \\ &= \langle e_j, v_i\rangle - \langle v_i, e_j\rangle \\ &= 0 \end{aligned}

The key difference from the e1,e2\langle e_1,e_2\rangle exercise above is using the orthogonality to cancel the other terms in the sum.

Exercise: Translate the mathematical algorithm I've described into python code using numpy and test it

Solution
import numpy as np

def normalize(v):
  return v / np.sqrt(np.inner(v,v))

v = [np.array(vi) for vi in [(1,4,2), (3,2,1), (5,1,2)]]
e = [normalize(v[0])]

for i in range(1, len(v)):
  e.append(normalize(
    v[i] - sum(
      np.inner(v[i], e[j])*e[j]
      for j in range(i)
    )
  ))

# test it
for i in range(len(v)):
  for j in range(i):
    assert np.allclose(np.inner(e[i], e[j]), 0)

This method is called the Gram–Schmidt process.

Least squares

We finally get to solve the approximation problem! We have a vector space VV, a subspace UU and some vector vVv \in V. We want to find the closest vector to vv in UU (ie. the projection).

I've you've been following along and doing the exercises you're close to understanding the method, the main ideas are:

  1. We can write v=a1u1++anun+b1v1++bmvmv = a_1u_1+\dots+a_nu_n + b_1v_1+\dots+b_mv_m where u1,,un,v1,,vmu_1,\dots,u_n,v_1,\dots,v_m is an orthonormal basis of VV, and u1,,unu_1,\dots,u_n is an orthonormal basis of UU.
  2. We can extract a1,,ana_1,\dots,a_n using inner products and orthogonality, meaning we can extract the component of vv in UU giving us the projection.

Exercise: Find a simple formula for aja_j in terms of vv and uju_j

Solution
aj=v,uja_j = \langle v, u_j\rangle

Since orthogonality kills off every other term in

v=a1u1++anun+b1v1++bnvnv = a_1u_1 + \dots + a_nu_n + b_1v_1 + \dots + b_nv_n

The above exercise allows us to define the projection PU(v)P_U(v) of vv as

PU(v)=u1,vu1++un,vunP_U(v) = \langle u_1, v\rangle u_1 + \dots + \langle u_n, v\rangle u_n

Exercise (optional): Prove PU:VUP_U : V \to U is a linear map (if you know what that means)

Solution

Via substituion and linearity of \langle\cdot\rangle we see

PU(v+w)=u1,v+wu1++un,v+wun=u1,vu1++un,vun+u1,wu1++un,wun\begin{aligned} P_U(v+w) &= \langle u_1, v+w\rangle u_1 + \dots + \langle u_n, v+w\rangle u_n \\ &= \langle u_1, v\rangle u_1 + \dots + \langle u_n, v\rangle u_n \\ &+ \langle u_1, w\rangle u_1 + \dots + \langle u_n, w\rangle u_n \end{aligned}

The PU(λv)P_U(\lambda v) case is the same.

Furthermore since PU(uj)=ujP_U(u_j) = u_j and PU(vj)=0P_U(v_j) = 0 the matrix of PUP_U in the u1,,vmu_1,\dots,v_m basis is the identity (for the usu's) followed by zeros (for the vv's)

Notice PU(v)=a1u1++anunP_U(v) = a_1u_1+\dots+a_nu_n, we extracted UU component of vv using inner products and orthogonality!

Now, intuitively it makes sense for PU(v)P_U(v) to be the closest vector to vv, and in Rn\mathbf R^n the intuition is correct. But for arbitrary vector spaces we need a more general argument.

Consider the remainder (error) vector r=PU(v)vr = P_U(v) - v, the key fact: the remainder is orthogonal to UU

Exercise: Show r,u=0\langle r, u\rangle = 0 for all uUu \in U (hint: use UU's basis and linearity of \langle\cdot\rangle)

Solution

Let uUu \in U and let c1,,cnc_1,\dots,c_n be such that

u=c1u1++cnunu = c_1u_1+\dots+c_nu_n

Meaning

r,u=c1r,u1++cnr,un\langle r, u\rangle = c_1\langle r, u_1\rangle + \dots + c_n\langle r, u_n\rangle

Now we expand the r=PU(v)vr = P_U(v) - v part of r,uj\langle r, u_j\rangle

r,uj=PU(v),ujv,uj\begin{aligned} \langle r, u_j\rangle = \langle P_U(v), u_j\rangle - \langle v, u_j\rangle \end{aligned}

Now from the definition of PU(v)P_U(v) and orthogonality we see PU(v),uj=vj,uj\langle P_U(v), u_j\rangle = \langle v_j, u_j\rangle meaning

r,uj=PU(v),ujv,uj=0\langle r, u_j\rangle = \langle P_U(v), u_j\rangle - \langle v, u_j\rangle = 0

Applying this for j=1,,nj=1,\dots,n gives r,u=0\langle r, u\rangle = 0 as desired.

Because of this, any tweak duUdu \in U to PU(v)P_U(v) can only move us away from vv! We can prove this using the generalized pythagorean theorem since rr and dudu are orthogonal.

(PU(v)+du)v2=r+du2=r2+du2\begin{aligned} \|(P_U(v) + du) - v\|^2 &= \|r + du\|^2 \\ &= \|r\|^2 + \|du\|^2 \end{aligned}

This has a minimum at du=0du = 0, meaning we can't do better!

Polynomial approximation

We've covered all the linear algebra we need! It's time to walk through approximating sin(x)\sin(x) over [π,π][-\pi, \pi] with a fifth degree polynomial.

Here's a simplified version of the code, which you should be able to understand!

import sympy as sp
x = sp.Symbol('x')

# function we're approximating
def f(x):
  return sp.sin(x)

a,b = -sp.pi, sp.pi
def inner(f, g):
  return sp.integrate(f*g, (x, a, b))

def normalize(v):
    return v / sp.sqrt(inner(v, v))

# basis of U (polynomials up to degree 5)
B = [1,x,x**2,x**3,x**4,x**5]

# orthogonalize B
O = [normalize(B[0])]
for i in range(1, len(B)):
    O.append(normalize(
        B[i] - sum(
          inner(B[i], O[j])*O[j] # <v_i, e_j>
          for j in range(0, i)
        )
    ))

# get coefficients
coeffs = [inner(f(x), O[j]) for j in range(len(O))]

# turn the coefficients into a sympy polynomial
poly = sum(O[j]*coeffs[j] for j in range(len(coeffs)))

# print the polynomial as latex!
print(sp.latex(poly))
# print the polynomial in numeric form
print(sp.latex(poly.evalf()))
# => 0.00564312 x^{5} - 0.1552714 x^{3} + 0.9878622 x

Full code (with more features) is here. Have fun playing with it!

If you don't understand, tell me! I'll be happy to explain, I want this to be understandable to everyone.

Notation

We denote vectors in a vector space VV by v,w,uVv,w,u \in V, If we're dealing with a subspace UU, then uu denotes a vector uUu \in U.

Scalars (real numbers) are denoted by λ\lambda or a,b,ca,b,c. often when taking linear combinations we have a list of scalars a1,,ana_1,\dots,a_n.

We denote the inner product between vv and ww as v,w\langle v, w\rangle. The specific inner product is clear from context. In other sources the dot product is written vTwv^Tw or vwv\cdot w, these notations are specific to the dot product and don't extend to general inner products.