Least Squares / Curve Fitting

Given a set of data in space, we ask: how do we find a “best fit line”?

Finding such a line would let us make generalizations and inferences from our data!

Least Squares Solutions

Suppose we have an inconsistent linear system $A x = b$ . As it is inconsistent, we simply cannot find an $x$ satisfying the system.

This makes sense for our purposes! It’s often impossible to find an exact solution for a line passing through all points (points may not be co-linear). However, it is possible to find an exact solution for a line minimizing its distance to all points! This is the idea behind least squares.

A least squares solution of $A x = b$ is a vector $\overset{x}{^}$ that minimizes the distance from $A \overset{x}{^}$ to $b$ , known as the least squares error.

∣∣ A \overset{x}{^} - b ∣∣

For some $\overset{x}{^}$ , this yields us the sum of the squares of the vertical distances between our line and the true collection of points.

Note that with this minimizer, for any other $x$ ,

∣∣ A x - b ∣∣ \geq ∣∣ A \overset{x}{^} - b ∣∣

We first build up some context below.

Finding the Least Squares Solution

Dot Products

Given $v, w$ in $R^{n}$ , their dot product is
$v \cdot w = v^{T} w = v_{1} w_{1} + v_{2} w_{2} + \dots$
The dot product relates to a handful of geometric quantities like lengths and angles.

The length (norm) of a vector $v$ is given as
$∣∣ v ∣∣ = v \cdot v = v_{1}^{2} + v_{2}^{2} + \dots$
And for any $v, w$ in $R^{n}$ , we have
$v \cdot w = ∣∣ v ∣∣∣∣ w ∣∣ cos θ$
where $θ$ is the angle between $v$ and $w$ .

We say $v, w$ are orthogonal if $v \cdot w = 0$ .

Note that this means that $0$ is uniquely orthogonal to every vector $w$ in $R^{n}$ .

Orthogonal Projections

Let $W$ be a subspace of $R^{n}$ ( $W$ is some span of a collection of vectors, a line through the origin, a plane through the origin, etc).

The orthogonal projection of $v$ onto a subspace $W$ , denoted $proj_{w} v$ , is the closest vector in $W$ to $v$ . It satisfies
$∣∣ v - proj_{W} v ∣∣ \leq ∣∣ v - w ∣∣$
$v - proj_{W} v$ is orthogonal to every possible $w$ in $W$ .

The column space of an $m \times n$ matrix $A$ is the span of its columns (the set of all linear combinations of its columns). Note that $Col A$ is a subspace of $R^{m}$ .

Every linear combination of the columns of $A$ has the form $A x$ for some $x$ in $R^{n}$ . So

Col A = {A x : x \in R^{n}}

Here, $b \in Col A$ if and only if there is some $x \in R^{n}$ such that $A x = b$ if and only if $A x = b$ is a consistent linear system.

Now suppose $A x = b$ is inconsistent. So, by above, if there is no solution then $b \neq \in Col A$ . Let $\hat{b} = Proj_{Col A} b$ . Then, because $\hat{b} \in Col A$ , we found the existence of an $\overset{x}{^}$ such that

A \overset{x}{^} = \hat{b}

Such an $\overset{x}{^}$ is a least squares solution, because $\hat{b}$ is the closest vector in $A$ ’s column space to $b$ !

We know that the vector $b - \hat{b}$ is orthogonal to every vector $w \in A$ . In other words, it is orthogonal to every $A y$ for any $y$ . So,

(A y) \cdot (b - A \overset{x}{^}) = 0 (A y)^{T} (b - A x) = 0 y^{T} A^{T} (b - A \overset{x}{^} 0 = 0 y^{T} (A^{T} b - A^{T} A x) = 0 y \cdot (A^{T} b - A^{T} A \overset{x}{^}) = 0

For every $y$ . This means that $A^{T} b - A^{T} A \overset{x}{^}$ must be the 0 vector, so

A^{T} b - A^{T} A \overset{x}{^} = 0 ⟹ A^{T} A \overset{x}{^} = A^{T} b

This is the equation that least squares must satisfy! So, to find $\overset{x}{^}$ , we solve the linear system

A^{T} A \overset{x}{^} = A^{T} b

Note the following:

$A^{T} A \overset{x}{^} = A^{T} b$ is always consistent (has a solution)
$A^{T} A \overset{x}{^} = A^{T} b$ has a unique solution if and only if $A^{T} A$ is invertible, if and only if $A$ has linearly independent columns.

Example: Least Squares

Find the least squares solutions to
$⎩ ⎨ ⎧ x_{1} + x_{2} = 6 - x_{1} + x_{2} = 3 2 x_{1} + 3 x_{2} = 9$
We can represent this as the system
$A \overset{x}{^} = b ⟹ 1 - 1 2 113 [x_{1} x_{2}] = 639$
We now solve $A^{T} A \overset{x}{^} = A^{T} b$ .
$[66611] \overset{x}{^} = [2136]$

Example: Least Squares (2)

X Y
1 1
2 2.4
3 3.6
4 4

Find a line best fitting the data.

What we can do is assume we have a line $y = B_{0} + B_{1} x$ , and then sub in the points. This will yield us vectors that we can solve with least squares!

X	Y
1	1
2	2.4
3	3.6
4	4

We can use least squares to solve a multitude of problems! As long as we can find a corresponding system of equations, we can solve the system with least squares.

Example: Best-Fit Parabola

x y
1 -0.15
2 0.68
3 1.21
4 0.86
5 -0.08

Find a best fit parabola taking on the form
$y = β_{0} + β_{1} x + β_{2} x^{2}$
We can form a linear system by plugging in our x and y values.
$- 0.15 = β_{0} + β_{1} (1) + β_{2} (1)^{2} 0.68 = β_{0} + β_{1} (2) + β_{2} (2)^{2} 1.21 = β_{0} + β_{1} (3) + β_{2} (3)^{2} 0.86 = β_{0} + β_{1} (4) + β_{2} (4)^{2} - 0.08 = β_{0} + β_{1} (5) + β_{2} (5)^{2}$
This is a linear system! We can write $A, x$ , and $b$ as follows
$A = 11111123451491625 x = β_{0} β_{1} β_{2} b = - 0.15 0.68 1.21 0.86 - 0.08$
We can use least squares to solve this!

x	y
1	-0.15
2	0.68
3	1.21
4	0.86
5	-0.08

But not all models will work with least squares! Least squares requires that we have a linear system, so anything that cannot form a linear system after subbing in x,y will not work.

Example: Applying Least-Squares to Non-Linear Systems

Suppose we have a non-linear model
$y = β_{0} + β_{1} e^{β_{2} x}$
For non-linear systems, one work around is to make an educated guess for the non-linear variables, and solving the subsequent linear system! So, we make a guess for $β_{2}$ , and then solve $β_{0}$ and and $β_{1}$ !

We can then compare the least squares errors to find which combination of variables works best!

Team Ranking

Suppose there is a league of sports teams who play each other. We want to be able to rank them from best to worst.

Rather than looking at their win/loss records, we would like to instead base our rankings on margins of victory, as this will give us less ambiguous results.

The general idea we will use was developed by Kenneth Massey, in his undergraduate thesis.

Example: Looking at Margins of Victory

Consider 3 teams T1, T2, T3

T1 beats T2

T2 beats T3

T3 beats T1

In this case, we can’t really say which team is best, as the situation is completely symmetric! But if we instead had

T1 beats T2 by 10 points

T2 beats T3 by 2 points

T3 beats T1 by 1 point

Then we have more information, and we can say as T1 completely crushed T2 and barely lost to T3, T1 seems to be the best!

The idea is to somehow assign each team a value, say $r_{i}$ (for team $i$ ), such that if team $i$ plays team $j$ , then the expected margin of victory (or defeat) for team $i$ is given by the difference $r_{i} - r_{j}$ .

Example: From Massey's Thesis

Suppose we have 4 teams:

(T1) The Beast Squares

(T2) The Gaussian Eliminators

(T3) The Likelihood Loggers

(T4) The Linear Regressors

Now suppose that T1 beats T2 by 4, T1 beats T4 by 2, T2 beats T3 by 1, T2 loses to T4 by 7, and T3 ties with T4. The idea is that we want to find each team a $r_{i}$ value where
$r_{1} - r_{2} = 4 r_{2} - r_{3} = 1 r_{3} - r_{4} = 0 r_{1} - r_{4} = 2 r_{2} - r_{4} = - 7$
This gives us a linear system, where we can solve for $r_{1}, r_{2}, r_{3}, r_{4}$ ! Generally these systems are inconsistent, but we can still find the best $r_{i}$ ’s using least squares!
$10010 - 1 1001 0 - 1 100 00 - 1 - 1 - 1 r_{1} r_{2} r_{3} r_{4} = 421 - 7 0$
One issue that happens is when we take $A^{T} A$ , we don’t get an invertible matrix (as there are infinite solutions)! So, we can’t simply find $r = (A^{T} A)^{- 1} A^{T} b$ .

Instead, let’s just row reduce and find all solutions $r$ can possibly take on! Here, we find our solution as
$1.125 - 3.75 - 2.375 0 + r_{4} 1111$
We find that $r_{1} > r_{4} > r_{3} > r_{2}$ , no matter what $r_{4}$ we choose! This is our team order.

Note that we can also use this to predict matches, even if they didn’t occur! For example, $r_{1} - r_{3} = 3.5$ , so we can predict that team 1 would beat team 3 by 3.5 points!

To remove the (infinite) choices of free variables that we have as solutions, we can impose the constraint that $r_{1} + r_{2} + r_{3} + r_{4} = 0$ (force the average to be 0) and add that as an additional requirement to our initial linear system.

One can prove that with this new requirement, we get a unique least squares solution, and moreover, this solution satisfies this “normalization condition”.

Shu-Ye's Quartz Space 🪴

Table of Contents

Least Squares

Least Squares / Curve Fitting

Least Squares Solutions

Finding the Least Squares Solution

Team Ranking

Graph View

Backlinks

Shu-Ye's Quartz Space 🪴

Table of Contents

Least Squares

Least Squares / Curve Fitting §

Least Squares Solutions §

Finding the Least Squares Solution §

Team Ranking §

Graph View

Backlinks

Least Squares / Curve Fitting

Least Squares Solutions

Finding the Least Squares Solution

Team Ranking