Least Squares / Curve Fitting
Given a set of data in space, we ask: how do we find a “best fit line”?
Finding such a line would let us make generalizations and inferences from our data!
Least Squares Solutions
Suppose we have an inconsistent linear system . As it is inconsistent, we simply cannot find an satisfying the system.
This makes sense for our purposes! It’s often impossible to find an exact solution for a line passing through all points (points may not be co-linear). However, it is possible to find an exact solution for a line minimizing its distance to all points! This is the idea behind least squares.
A least squares solution of is a vector that minimizes the distance from to , known as the least squares error.
For some , this yields us the sum of the squares of the vertical distances between our line and the true collection of points.
Note that with this minimizer, for any other ,
We first build up some context below.
Finding the Least Squares Solution
Dot Products
Given in , their dot product is
The dot product relates to a handful of geometric quantities like lengths and angles.
The length (norm) of a vector is given as
And for any in , we have
where is the angle between and .
We say are orthogonal if .
Note that this means that is uniquely orthogonal to every vector in .
Orthogonal Projections
Let be a subspace of ( is some span of a collection of vectors, a line through the origin, a plane through the origin, etc).
The orthogonal projection of onto a subspace , denoted , is the closest vector in to . It satisfies
is orthogonal to every possible in .
The column space of an matrix is the span of its columns (the set of all linear combinations of its columns). Note that is a subspace of .
Every linear combination of the columns of has the form for some in . So
Here, if and only if there is some such that if and only if is a consistent linear system.
Now suppose is inconsistent. So, by above, if there is no solution then . Let . Then, because , we found the existence of an such that
Such an is a least squares solution, because is the closest vector in ’s column space to !
We know that the vector is orthogonal to every vector . In other words, it is orthogonal to every for any . So,
For every . This means that must be the 0 vector, so
This is the equation that least squares must satisfy! So, to find , we solve the linear system
Note the following:
- is always consistent (has a solution)
- has a unique solution if and only if is invertible, if and only if has linearly independent columns.
Example: Least Squares
Find the least squares solutions to
We can represent this as the system
We now solve .
Example: Least Squares (2)
X Y 1 1 2 2.4 3 3.6 4 4 Find a line best fitting the data.
What we can do is assume we have a line , and then sub in the points. This will yield us vectors that we can solve with least squares!
We can use least squares to solve a multitude of problems! As long as we can find a corresponding system of equations, we can solve the system with least squares.
Example: Best-Fit Parabola
x y 1 -0.15 2 0.68 3 1.21 4 0.86 5 -0.08 Find a best fit parabola taking on the form
We can form a linear system by plugging in our x and y values.
This is a linear system! We can write , and as follows
We can use least squares to solve this!
But not all models will work with least squares! Least squares requires that we have a linear system, so anything that cannot form a linear system after subbing in x,y will not work.
Example: Applying Least-Squares to Non-Linear Systems
Suppose we have a non-linear model
For non-linear systems, one work around is to make an educated guess for the non-linear variables, and solving the subsequent linear system! So, we make a guess for , and then solve and and !
We can then compare the least squares errors to find which combination of variables works best!
Team Ranking
Suppose there is a league of sports teams who play each other. We want to be able to rank them from best to worst.
Rather than looking at their win/loss records, we would like to instead base our rankings on margins of victory, as this will give us less ambiguous results.
The general idea we will use was developed by Kenneth Massey, in his undergraduate thesis.
Example: Looking at Margins of Victory
Consider 3 teams T1, T2, T3
- T1 beats T2
- T2 beats T3
- T3 beats T1
In this case, we can’t really say which team is best, as the situation is completely symmetric! But if we instead had
- T1 beats T2 by 10 points
- T2 beats T3 by 2 points
- T3 beats T1 by 1 point
Then we have more information, and we can say as T1 completely crushed T2 and barely lost to T3, T1 seems to be the best!
The idea is to somehow assign each team a value, say (for team ), such that if team plays team , then the expected margin of victory (or defeat) for team is given by the difference .
Example: From Massey's Thesis
Suppose we have 4 teams:
- (T1) The Beast Squares
- (T2) The Gaussian Eliminators
- (T3) The Likelihood Loggers
- (T4) The Linear Regressors
Now suppose that T1 beats T2 by 4, T1 beats T4 by 2, T2 beats T3 by 1, T2 loses to T4 by 7, and T3 ties with T4. The idea is that we want to find each team a value where
This gives us a linear system, where we can solve for ! Generally these systems are inconsistent, but we can still find the best ’s using least squares!
One issue that happens is when we take , we don’t get an invertible matrix (as there are infinite solutions)! So, we can’t simply find .
Instead, let’s just row reduce and find all solutions can possibly take on! Here, we find our solution as
We find that , no matter what we choose! This is our team order.
Note that we can also use this to predict matches, even if they didn’t occur! For example, , so we can predict that team 1 would beat team 3 by 3.5 points!
To remove the (infinite) choices of free variables that we have as solutions, we can impose the constraint that (force the average to be 0) and add that as an additional requirement to our initial linear system.
One can prove that with this new requirement, we get a unique least squares solution, and moreover, this solution satisfies this “normalization condition”.