Source: https://dsc40a.com/resources/lectures/lec08/lec08-filled.pdf

Regression and Linear Algebra

Overview: Spans and Projections

Regression and Linear Algebra

Projecting onto the span of a single vector

  • Question: What vector in is closest to ?
  • The answer is the vector , where the is chosen to minimize the length of the error vector:
  • Key idea: To minimize the length of the error vector, choose so that the error vector is orthogonal to .
  • Answer: It is the vector , where:

Projecting onto the span of multiple vectors

  • Question: What vector in is closest to ?
  • The answer is the vector , where and are chosen to minimize the length of the error vector:
  • Key idea: To minimize the length of the error vector, choose and so that the error vector is orthogonal to both and .

If and are linearly
independent
, they span a plane.

Matrix-vector products create linear combinations of columns!

  • Question: What vector in is closest to ?
  • To help, we can create a matrix, , by stacking and next to each other:
  • Then, instead of writing vectors in as , we can say:
  • Key idea: Find such that the error vector, , is orthogonal to every column of .

Constructing an orthogonal error vector

  • Key idea: Find such that the error vector, , is orthogonal to the columns of .
  • Why? Because this will make the error vector as short as possible.
  • The that accomplishes this satisfies:
  • Why? Because contains the dot products of each column in with . If these are all 0, then is orthogonal to every column of !

The normal equations

  • Key idea: Find such that the error vector, , is orthogonal to the columns of .

  • The that accomplishes this satisfies:

  • The last statement is referred to as the normal equations.

  • Assuming is invertible, this is the vector:

  • This is a big assumption, because it requires to be full rank.

  • If is not full rank, then there are infinitely many solutions to the normal equations, .

What does it mean?

  • Original question: What vector in is closest to ?
  • Final answer: Assuming is invertible, it is the vector , where:
  • Revisiting our example:
  • Using a computer gives us .
  • So, the vector in closest to is .

An optimization problem, solved

  • We just used linear algebra to solve an optimization problem.
  • Specifically, the function we minimized is:
  • This is a function whose input is a vector, , and whose output is a scalar!
  • The input, , to that minimizes it is one that satisfies the normal equations:

    If is invertible, then the unique solution is:
  • We’re going to use this frequently!

Regression and linear algebra

Wait… why do we need linear algebra?

  • Soon, we’ll want to make predictions using more than one feature.
  • Example: Predicting commute times using departure hour and temperature.
  • Thinking about linear regression in terms of matrices and vectors will allow us to find hypothesis functions that:
  • Use multiple features (input variables).
  • Are non-linear in the features, e.g. .
  • Let’s see if we can put what we’ve just learned to use.

Simple linear regression, revisited

  • Model: .
  • Loss function: .
  • To find and , we minimized empirical risk, i.e. average loss:
  • Observation: kind of looks like the formula for the norm of a vector, .

Regression and linear algebra

Let’s define a few new terms:

  • The observation vector is the vector . This is the vector of observed “actual values”.
  • The hypothesis vector is the vector with components . This is the vector of predicted values.
  • The error vector is the vector with components:

Example

Consider .


Note:

Regression and linear algebra

Let’s define a few new terms:

  • The observation vector is the vector . This is the vector of observed “actual values”.
  • The hypothesis vector is the vector with components . This is the vector of predicted values.
  • The error vector is the vector with components:
  • Key idea: We can rewrite the mean squared error of as:

The hypothesis vector

  • The hypothesis vector is the vector with components . This is the vector of predicted values.
  • For the linear hypothesis function , the hypothesis vector can be written:

Rewriting the mean squared error

  • Define the design matrix as:
  • Define the parameter vector to be .
  • Then, , so the mean squared error becomes:

Minimizing mean squared error, again

  • To find the optimal model parameters for simple linear regression, and , we previously minimized:
  • Now that we’ve reframed the simple linear regression problem in terms of linear algebra, we can find and by finding the that minimizes:
  • Do we already know the that minimizes ?

An optimization problem we’ve seen before

  • The optimal parameter vector, , is the one that minimizes:
  • Previously, we found that minimizes the length of the error vector,
  • is closely related to :
  • The minimizer of is the same as the minimizer of !
  • Key idea: also minimizes !

The optimal parameter vector,

  • To find the optimal model parameters for simple linear regression, and , we previously minimized .
  • We found, using calculus, that:
  • .
  • .
  • Another way of finding optimal model parameters for simple linear regression is to find the that minimizes .
  • The minimizer, if is invertible, is the vector .
  • These formulas are equivalent!

Roadmap

  • To give us a break from math, we’ll switch to a notebook, linked here, showing that both formulas – that is, (1) the formulas for and we found using calculus, and (2) the formula for we found using linear algebra – give the same results.
  • Then, we’ll use our new linear algebraic formulation of regression to incorporate multiple features in our prediction process.

Summary: Regression and linear algebra

  • Define the design matrix , observation vector , and parameter vector as:
  • How do we make the hypothesis vector, , as close to as possible? Use the parameter vector :
  • We chose so that is the projection of onto the span of the columns of the design matrix, .

Multiple linear regression

So far, we’ve fit simple linear regression models, which use only one feature ('departure_hour') for making predictions.

Incorporating multiple features

  • In the context of the commute times dataset, the simple linear regression model we fit was of the form:
  • Now, we’ll try and fit a multiple linear regression model of the form:
  • Linear regression with multiple features is called multiple linear regression.
  • How do we find , , and ?

Geometric interpretation

  • The hypothesis function:

    looks like a line in 2D.
  • Questions:
  • How many dimensions do we need to graph the hypothesis function:
  • What is the shape of the hypothesis function?
    Our new hypothesis function is a plane in 3D!

The setup

  • Suppose we have the following dataset.
  • We can represent each day with a feature vector, :

The hypothesis vector

  • When our hypothesis function is of the form:

    the hypothesis vector can be written as:
Missing \begin{bmatrix} or extra \end{bmatrix}H(\text{departure hour}_1, \text{day}_1) \\ H(\text{departure hour}_2, \text{day}_2) \\ ... \\ H(\text{departure hour}_n, \text{day}_n) \\ \end{bmatrix} = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} $$ ### Finding the optimal parameters - To find the optimal parameter vector, $\vec{w}^*$, we can use the **design matrix** $\color{#007aff} X \in \mathbb{R}^{n \times 3}$ and **observation vector** $\color{orange} \vec y \in \mathbb{R}^n$: $${\color{#007aff} X = \begin{bmatrix} 1 & \text{departure hour}_1 & \text{day}_1 \\ 1 & \text{departure hour}_2 & \text{day}_2 \\ ... & ... & ... \\ 1 & \text{departure hour}_n & \text{day}_n \end{bmatrix}} \qquad {\color{orange} \vec{y} = \begin{bmatrix} \text{commute time}_1 \\ \text{commute time}_2 \\ \vdots \\ \text{commute time}_n \end{bmatrix}}$$ - Then, all we need to do is solve the **normal equations**: $${\color{#007aff} X^TX} \vec{w}^* = {\color{#007aff} X^T} {\color{orange} \vec y}$$ If ${\color{#007aff} X^TX}$ is invertible, we know the solution is: $$\vec{w}^* = ({\color{#007aff}{X^TX}})^{-1} {\color{#007aff}{X^T}}\color{orange}\vec{y}$$ ### Roadmap - To wrap up today's lecture, we'll find the optimal parameter vector $\vec{w}^*$ for our new two-feature model in code. We'll switch back to our notebook, [linked here](http://datahub.ucsd.edu/user-redirect/git-sync?repo=https://github.com/dsc-courses/dsc40a-2024-sp&subPath=lectures/lec08/lec08-code.ipynb). - Next class, we'll present a more general framing of the multiple linear regression model, that uses $d$ features instead of just two. - We'll also look at how we can **engineer** new features using existing features. - e.g. How can we fit a hypothesis function of the form $H(x) = w_0 + w_1 x + w_2 x^2$?