Source: https://dsc40a.com/resources/lectures/lec08/lec08-filled.pdf

Regression and Linear Algebra

Overview: Spans and Projections

Regression and Linear Algebra

Projecting onto the span of a single vector

Question: What vector in is closest to ?
The answer is the vector , where the is chosen to minimize the length of the error vector:
Key idea: To minimize the length of the error vector, choose so that the error vector is orthogonal to .
Answer: It is the vector , where:

Projecting onto the span of multiple vectors

Question: What vector in is closest to ?
The answer is the vector , where and are chosen to minimize the length of the error vector:
Key idea: To minimize the length of the error vector, choose and so that the error vector is orthogonal to both and .

If and are linearly
independent, they span a plane.

Matrix-vector products create linear combinations of columns!

Question: What vector in is closest to ?
To help, we can create a matrix, , by stacking and next to each other:
Then, instead of writing vectors in as , we can say:
Key idea: Find such that the error vector, , is orthogonal to every column of .

Constructing an orthogonal error vector

Key idea: Find such that the error vector, , is orthogonal to the columns of .
Why? Because this will make the error vector as short as possible.
The that accomplishes this satisfies:
Why? Because contains the dot products of each column in with . If these are all 0, then is orthogonal to every column of !

The normal equations

Key idea: Find such that the error vector, , is orthogonal to the columns of .
The that accomplishes this satisfies:
The last statement is referred to as the normal equations.
Assuming is invertible, this is the vector:
This is a big assumption, because it requires to be full rank.
If is not full rank, then there are infinitely many solutions to the normal equations, .

What does it mean?

Original question: What vector in is closest to ?
Final answer: Assuming is invertible, it is the vector , where:
Revisiting our example:
Using a computer gives us .
So, the vector in closest to is .

An optimization problem, solved

We just used linear algebra to solve an optimization problem.
Specifically, the function we minimized is:
This is a function whose input is a vector, , and whose output is a scalar!
The input, , to that minimizes it is one that satisfies the normal equations:

If is invertible, then the unique solution is:
We’re going to use this frequently!

Regression and linear algebra

Wait… why do we need linear algebra?

Soon, we’ll want to make predictions using more than one feature.
Example: Predicting commute times using departure hour and temperature.
Thinking about linear regression in terms of matrices and vectors will allow us to find hypothesis functions that:
Use multiple features (input variables).
Are non-linear in the features, e.g. .
Let’s see if we can put what we’ve just learned to use.

Simple linear regression, revisited

Model: .
Loss function: .
To find and , we minimized empirical risk, i.e. average loss:
Observation: kind of looks like the formula for the norm of a vector, .

Regression and linear algebra

Let’s define a few new terms:

The observation vector is the vector . This is the vector of observed “actual values”.
The hypothesis vector is the vector with components . This is the vector of predicted values.
The error vector is the vector with components:

Example

Consider .

Note:

Regression and linear algebra

Let’s define a few new terms:

The observation vector is the vector . This is the vector of observed “actual values”.
The hypothesis vector is the vector with components . This is the vector of predicted values.
The error vector is the vector with components:
Key idea: We can rewrite the mean squared error of as:

The hypothesis vector

The hypothesis vector is the vector with components . This is the vector of predicted values.
For the linear hypothesis function , the hypothesis vector can be written:

Rewriting the mean squared error

Define the design matrix as:
Define the parameter vector to be .
Then, , so the mean squared error becomes:

Minimizing mean squared error, again

To find the optimal model parameters for simple linear regression, and , we previously minimized:
Now that we’ve reframed the simple linear regression problem in terms of linear algebra, we can find and by finding the that minimizes:
Do we already know the that minimizes ?

An optimization problem we’ve seen before

The optimal parameter vector, , is the one that minimizes:
Previously, we found that minimizes the length of the error vector,
is closely related to :
The minimizer of is the same as the minimizer of !
Key idea: also minimizes !

The optimal parameter vector,

To find the optimal model parameters for simple linear regression, and , we previously minimized .
We found, using calculus, that:
.
.
Another way of finding optimal model parameters for simple linear regression is to find the that minimizes .
The minimizer, if is invertible, is the vector .
These formulas are equivalent!

Roadmap

To give us a break from math, we’ll switch to a notebook, linked here, showing that both formulas – that is, (1) the formulas for and we found using calculus, and (2) the formula for we found using linear algebra – give the same results.
Then, we’ll use our new linear algebraic formulation of regression to incorporate multiple features in our prediction process.

Summary: Regression and linear algebra

Define the design matrix , observation vector , and parameter vector as:
How do we make the hypothesis vector, , as close to as possible? Use the parameter vector :
We chose so that is the projection of onto the span of the columns of the design matrix, .

Multiple linear regression

So far, we’ve fit simple linear regression models, which use only one feature ('departure_hour') for making predictions.

Incorporating multiple features

In the context of the commute times dataset, the simple linear regression model we fit was of the form:
Now, we’ll try and fit a multiple linear regression model of the form:
Linear regression with multiple features is called multiple linear regression.
How do we find , , and ?

Geometric interpretation

The hypothesis function:

looks like a line in 2D.
Questions:
How many dimensions do we need to graph the hypothesis function:
What is the shape of the hypothesis function?
Our new hypothesis function is a plane in 3D!

The setup

Suppose we have the following dataset.
We can represent each day with a feature vector, :

The hypothesis vector

When our hypothesis function is of the form:

the hypothesis vector can be written as:

Missing \begin{bmatrix} or extra \end{bmatrix}H(\text{departure hour}_1, \text{day}_1) \\ H(\text{departure hour}_2, \text{day}_2) \\ ... \\ H(\text{departure hour}_n, \text{day}_n) \\ \end{bmatrix} = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} $$ ### Finding the optimal parameters - To find the optimal parameter vector, $\vec{w}^*$, we can use the **design matrix** $\color{#007aff} X \in \mathbb{R}^{n \times 3}$ and **observation vector** $\color{orange} \vec y \in \mathbb{R}^n$: $${\color{#007aff} X = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix}} \qquad {\color{orange} \vec{y} = \begin{bmatrix} \text{commute time}_1 \\ \text{commute time}_2 \\ \vdots \\ \text{commute time}_n \end{bmatrix}}$$ - Then, all we need to do is solve the **normal equations**: $${\color{#007aff} X^TX} \vec{w}^* = {\color{#007aff} X^T} {\color{orange} \vec y}$$ If ${\color{#007aff} X^TX}$ is invertible, we know the solution is: $$\vec{w}^* = ({\color{#007aff}{X^TX}})^{-1} {\color{#007aff}{X^T}}\color{orange}\vec{y}$$ ### Roadmap - To wrap up today's lecture, we'll find the optimal parameter vector $\vec{w}^*$ for our new two-feature model in code. We'll switch back to our notebook, [linked here](http://datahub.ucsd.edu/user-redirect/git-sync?repo=https://github.com/dsc-courses/dsc40a-2024-sp&subPath=lectures/lec08/lec08-code.ipynb). - Next class, we'll present a more general framing of the multiple linear regression model, that uses $d$ features instead of just two. - We'll also look at how we can **engineer** new features using existing features. - e.g. How can we fit a hypothesis function of the form $H(x) = w_0 + w_1 x + w_2 x^2$?

Carter's Digital Garden

Explorer

DSC 40A Lecture 8

Regression and Linear Algebra

Overview: Spans and Projections

Regression and Linear Algebra

Projecting onto the span of a single vector

Projecting onto the span of multiple vectors

Matrix-vector products create linear combinations of columns!

Constructing an orthogonal error vector

The normal equations

What does it mean?

An optimization problem, solved

Regression and linear algebra

Wait… why do we need linear algebra?

Simple linear regression, revisited

Regression and linear algebra

Example

Regression and linear algebra

The hypothesis vector

Rewriting the mean squared error

Minimizing mean squared error, again

An optimization problem we’ve seen before

The optimal parameter vector,

Roadmap

Summary: Regression and linear algebra

Multiple linear regression

Incorporating multiple features

Geometric interpretation

The setup

The hypothesis vector

Graph View

Table of Contents

Backlinks