Least Squares

Recall our previous example, here we have a system of equation $\mathbb{A}\mathbb{X}=\mathbb{Y}$ .
let's take an example, suppose we have $3$ points (in form of $a_1,a_2$ ) $(1,1),(2,2),(3,2)$ .
Our objective is to get a best possible linear function for $a_2$ , say that function be $f$ .
Our function might not give exact $a_2$ that corresponds to $a_1$ but it will give us best possible approximation for $a_2$ .
The simplest linear function is $a_2 = f(a_1) = x_1 + a_1 x_2$ .
Here $x_1, x_2$ are our parameters (unknown)
Our observations says,
$f=x_1 + 1 x_2 =1$ ,
$f=x_1 + 2 x_2 =2$ ,
$f=x_1 + 3 x_2 =2$ ,
We can also write it as,

\underbrace{\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ \end{bmatrix}}_{\mathbb{A}} \underbrace{\begin{bmatrix} x_1\\ x_2\\ \end{bmatrix}}_{\mathbb{X}} = \underbrace{\begin{bmatrix} 1\\ 2\\ 2\\ \end{bmatrix}}_{\mathbb{Y}}

\mathbb{A}\mathbb{X}=\mathbb{Y}

So we want to find the linear combinations of column vectors of $\mathbb{A}$ that gives us $\mathbb{Y}$ , but $\mathbb{Y}$ does not lives in the column space of $\mathbb{A}$ .
So now we will find a vector $\widehat{\mathbb{Y}}$ in the column space of $\mathbb{A}$ that is closest to $\mathbb{Y}$ , here closeness is determined by the Euclidean distance between $\mathbb{Y}$ and $\widehat{\mathbb{Y}}$ .
So instead we will solve,

\mathbb{A}\widehat{\mathbb{X}}=\widehat{\mathbb{Y}}

(And $\widehat{\mathbb{X}}$ is just a way to tell that our solution is an estimate of exact solution).
$\widehat{\mathbb{Y}}$ lives in the column space of $\mathbb{A}$ , and $\mathbb{Y}$ is out of the column space of $\mathbb{A}$ , so
The vector $\mathbb{Y} - \widehat{\mathbb{Y}}$ is perpendicular to the column space of $\mathbb{A}$ .
$\Rightarrow \mathbb{Y} - \widehat{\mathbb{Y}}$ is in the Null space of $\mathbb{A}^T$ .
$\Rightarrow \mathbb{A}^T(\mathbb{Y} - \widehat{\mathbb{Y}})=0$ and we know that $\widehat{\mathbb{Y}}=\mathbb{A}\widehat{\mathbb{X}}$ .
$\Rightarrow \mathbb{A}^T(\mathbb{Y} - \mathbb{A}\widehat{\mathbb{X}})=0$
$\Rightarrow \mathbb{A}^T\mathbb{A}\widehat{\mathbb{X}}=\mathbb{A}^T \mathbb{Y}$

$\mathbb{A}^T\mathbb{A}= \begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix},\quad$ $\mathbb{A}^T\mathbb{Y}= \begin{bmatrix} 5\\ 11\\ \end{bmatrix},\quad$

Now we have to solve,

\begin{bmatrix} 3 & 6 \\ 6 & 14 \\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2\\ \end{bmatrix} = \begin{bmatrix} 5\\ 11\\ \end{bmatrix}

We can write it as,
$3x_1 + 6x_2 = 5$
$6x_1 + 14x_2 = 11$
By solving we get $x_1=2/3$ and $x_2=1/2$

So our function $f(a_1) = x_1 + a_1 x_2$ becomes,

f(a_1) = \frac{2}{3} + \frac{1}{2}a_1

Let's take a look at our estimate for our $3$ data points and it's error(which is $a_2 - \hat{a_2}$ ).

For $(a_1,a_2)=(1,1)$
$\hat{a_2} = \frac{2}{3} + \frac{1}{2}(1)=\frac{7}{6}$
$e_1= 1 - \frac{7}{6} = -\frac{1}{6}$
For $(a_1,a_2)=(2,2)$
$\hat{a_2} = \frac{2}{3} + \frac{1}{2}(2)=\frac{5}{3}$
$e_2= 2 - \frac{5}{3}= \frac{1}{3}$
For $(a_1,a_2)=(3,2)$
$\hat{a_2} = \frac{2}{3} + \frac{1}{2}(3)=\frac{13}{6}$
$e_3= 2 - \frac{13}{6} = -\frac{1}{6}$

Now represent our estimate and errors as vector,

\widehat{\mathbb{Y}} = \begin{bmatrix} \frac{7}{6}\\ \\ \frac{5}{3} \\ \\ \frac{13}{6} \\ \end{bmatrix} ,\quad e=\mathbb{Y}-\widehat{\mathbb{Y}} = \begin{bmatrix} -\frac{1}{6}\\ \\ \frac{1}{3} \\ \\ -\frac{1}{6} \\ \end{bmatrix}

As we discussed above that $\widehat{\mathbb{Y}}$ is in column space of $\mathbb{A}$ and $\mathbb{Y}-\widehat{\mathbb{Y}}$ is perpendicular to that column space.
We can now see it in this example.
First notice that dot product of $\mathbb{Y}$ and $\mathbb{Y}-\widehat{\mathbb{Y}}$ is $0$ .

\mathbb{Y}\cdot(\mathbb{Y}-\widehat{\mathbb{Y}})=\mathbb{Y}^T(\mathbb{Y}-\widehat{\mathbb{Y}})=0

As we said that $\mathbb{Y}-\widehat{\mathbb{Y}}$ is perpendicular to the whole column space, you can took any linear combinations of the columns of $\mathbb{A}$ it will be perpendicular to $\mathbb{Y}-\widehat{\mathbb{Y}}$ .