The answer is the vector , where the is chosen to minimize the length of the error vector:
Key idea: To minimize the length of the error vector, choose so that the error vector is orthogonal to .
Answer: It is the vector , where:
Projecting onto the span of multiple vectors
Question: What vector in is closest to ?
The answer is the vector , where and are chosen to minimize the length of the error vector:
Key idea: To minimize the length of the error vector, choose and so that the error vector is orthogonal to bothand.
If and are linearly independent, they span a plane.
Matrix-vector products create linear combinations of columns!
Question: What vector in is closest to ?
To help, we can create a matrix, , by stacking and next to each other:
Then, instead of writing vectors in as , we can say:
Key idea: Find such that the error vector, , is orthogonal to every column of .
Constructing an orthogonal error vector
Key idea: Find such that the error vector, , is orthogonal to the columns of .
Why? Because this will make the error vector as short as possible.
The that accomplishes this satisfies:
Why? Because contains the dot products of each column in with . If these are all 0, then is orthogonal to every column of !
The normal equations
Key idea: Find such that the error vector, , is orthogonal to the columns of .
The that accomplishes this satisfies:
The last statement is referred to as the normal equations.
Assuming is invertible, this is the vector:
This is a big assumption, because it requires to be full rank.
If is not full rank, then there are infinitely many solutions to the normal equations, .
What does it mean?
Original question: What vector in is closest to ?
Final answer: Assuming is invertible, it is the vector , where:
Revisiting our example:
Using a computer gives us .
So, the vector in closest to is .
An optimization problem, solved
We just used linear algebra to solve an optimization problem.
Specifically, the function we minimized is:
This is a function whose input is a vector, , and whose output is a scalar!
The input, , to that minimizes it is one that satisfies the normal equations:
If is invertible, then the unique solution is:
We’re going to use this frequently!
Regression and linear algebra
Wait… why do we need linear algebra?
Soon, we’ll want to make predictions using more than one feature.
Example: Predicting commute times using departure hour and temperature.
Thinking about linear regression in terms of matrices and vectors will allow us to find hypothesis functions that:
Use multiple features (input variables).
Are non-linear in the features, e.g. .
Let’s see if we can put what we’ve just learned to use.
Simple linear regression, revisited
Model: .
Loss function: .
To find and , we minimized empirical risk, i.e. average loss:
Observation: kind of looks like the formula for the norm of a vector, .
Regression and linear algebra
Let’s define a few new terms:
The observation vector is the vector . This is the vector of observed “actual values”.
The hypothesis vector is the vector with components . This is the vector of predicted values.
The error vector is the vector with components:
Example
Consider .
Note:
Regression and linear algebra
Let’s define a few new terms:
The observation vector is the vector . This is the vector of observed “actual values”.
The hypothesis vector is the vector with components . This is the vector of predicted values.
The error vector is the vector with components:
Key idea: We can rewrite the mean squared error of as:
The hypothesis vector
The hypothesis vector is the vector with components . This is the vector of predicted values.
For the linear hypothesis function , the hypothesis vector can be written:
Rewriting the mean squared error
Define the design matrix as:
Define the parameter vector to be .
Then, , so the mean squared error becomes:
Minimizing mean squared error, again
To find the optimal model parameters for simple linear regression, and , we previously minimized:
Now that we’ve reframed the simple linear regression problem in terms of linear algebra, we can find and by finding the that minimizes:
Do we already know the that minimizes ?
An optimization problem we’ve seen before
The optimal parameter vector, , is the one that minimizes:
Previously, we found that minimizes the length of the error vector,
is closely related to :
The minimizer of is the same as the minimizer of !
Key idea: also minimizes!
The optimal parameter vector,
To find the optimal model parameters for simple linear regression, and , we previously minimized .
We found, using calculus, that:
.
.
Another way of finding optimal model parameters for simple linear regression is to find the that minimizes .
The minimizer, if is invertible, is the vector .
These formulas are equivalent!
Roadmap
To give us a break from math, we’ll switch to a notebook, linked here, showing that both formulas – that is, (1) the formulas for and we found using calculus, and (2) the formula for we found using linear algebra – give the same results.
Then, we’ll use our new linear algebraic formulation of regression to incorporate multiple features in our prediction process.
Summary: Regression and linear algebra
Define the design matrix, observation vector, and parameter vector as:
How do we make the hypothesis vector, , as close to as possible? Use the parameter vector :
We chose so that is the projection of onto the span of the columns of the design matrix, .
Multiple linear regression
So far, we’ve fit simple linear regression models, which use only one feature ('departure_hour') for making predictions.
Incorporating multiple features
In the context of the commute times dataset, the simple linear regression model we fit was of the form:
Now, we’ll try and fit a multiple linear regression model of the form:
Linear regression with multiple features is called multiple linear regression.
How do we find , , and ?
Geometric interpretation
The hypothesis function:
looks like a line in 2D.
Questions:
How many dimensions do we need to graph the hypothesis function:
What is the shape of the hypothesis function?
Our new hypothesis function is a plane in 3D!
The setup
Suppose we have the following dataset.
We can represent each day with a feature vector, :
The hypothesis vector
When our hypothesis function is of the form:
the hypothesis vector can be written as:
Missing \begin{bmatrix} or extra \end{bmatrix}H(\text{departure hour}_1, \text{day}_1) \\ H(\text{departure hour}_2, \text{day}_2) \\ ... \\ H(\text{departure hour}_n, \text{day}_n) \\ \end{bmatrix} = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} $$ ### Finding the optimal parameters - To find the optimal parameter vector, $\vec{w}^*$, we can use the **design matrix** $\color{#007aff} X \in \mathbb{R}^{n \times 3}$ and **observation vector** $\color{orange} \vec y \in \mathbb{R}^n$: $${\color{#007aff} X = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix}} \qquad {\color{orange} \vec{y} = \begin{bmatrix} \text{commute time}_1 \\ \text{commute time}_2 \\ \vdots \\ \text{commute time}_n \end{bmatrix}}$$ - Then, all we need to do is solve the **normal equations**: $${\color{#007aff} X^TX} \vec{w}^* = {\color{#007aff} X^T} {\color{orange} \vec y}$$ If ${\color{#007aff} X^TX}$ is invertible, we know the solution is: $$\vec{w}^* = ({\color{#007aff}{X^TX}})^{-1} {\color{#007aff}{X^T}}\color{orange}\vec{y}$$ ### Roadmap - To wrap up today's lecture, we'll find the optimal parameter vector $\vec{w}^*$ for our new two-feature model in code. We'll switch back to our notebook, [linked here](http://datahub.ucsd.edu/user-redirect/git-sync?repo=https://github.com/dsc-courses/dsc40a-2024-sp&subPath=lectures/lec08/lec08-code.ipynb). - Next class, we'll present a more general framing of the multiple linear regression model, that uses $d$ features instead of just two. - We'll also look at how we can **engineer** new features using existing features. - e.g. How can we fit a hypothesis function of the form $H(x) = w_0 + w_1 x + w_2 x^2$?