Source: https://dsc40a.com/resources/lectures/lec09/lec09-filled.pdf

Multiple Linear Regression

Multiple linear regression

So far, we’ve fit simple linear regression models, which use only one feature ('departure_hour') for making predictions.

Incorporating multiple features

In the context of the commute times dataset, the simple linear regression model we fit was of the form:
Now, we’ll try and fit a multiple linear regression model of the form:
Linear regression with multiple features is called multiple linear regression.
How do we find , , and ?

Geometric interpretation

The hypothesis function:

looks like a line in 2D.
Questions:
How many dimensions do we need to graph the hypothesis function:
What is the shape of the hypothesis function?

Our new hypothesis function is a plane in 3D! Our goal is to find the plane of best fit that pierces through the cloud of points.

The setup

Suppose we have the following dataset.
We can represent each day with a feature vector, :

The hypothesis vector

When our hypothesis function is of the form:

the hypothesis vector can be written as:

Finding the optimal parameters

To find the optimal parameter vector, , we can use the design matrix and observation vector :
Then, all we need to do is solve the normal equations:If is invertible, we know the solution is:

Notation for multiple linear regression

We will need to keep track of multiple features for every individual in our dataset.
In practice, we could have hundreds or thousands of features!
As before, subscripts distinguish between individuals in our dataset. We have individuals, also called training examples.
Superscripts distinguish between features. We have features.

Think of , , … as new variable names, like new letters.

Augmented feature vectors

The augmented feature vector is the vector obtained by adding a 1 to the front of feature vector :
Then, our hypothesis function is:

The general problem

We have data points, ,
where each is a feature vector of features:
We want to find a good linear hypothesis function:

The general solution

Define the design matrix and observation vector :
Then, solve the normal equations to find the optimal parameter vector, :

Terminology for parameters

With features, has entries.
is the bias, also known as the intercept.
each give the weight, or coefficient, or slope, of a feature.

Interpreting parameters

Example: Predicting sales

For each of 26 stores, we have:
net sales,
square feet,
inventory,
advertising expenditure,
district size, and
number of competing stores.
Goal: Predict net sales given the other five features.
To begin, we’ll start trying to fit the hypothesis function to predict sales:

Which features are most “important”?

The most important feature is not necessarily the feature with largest magnitude weight.
Features are measured in different units, i.e. different scales.
Suppose I fit one hypothesis function, , with sales in US dollars, and another hypothesis function, , with sales in Japanese yen (1 USD 157 yen).
Sales is just as important in both hypothesis functions.
But the weight of sales in will be 157 times larger than the weight of sales in .
Solution: If you care about the interpretability of the resulting weights, standardize each feature before performing regression, i.e. convert each feature to standard units.

Standard units

Recall: to convert a feature to standard units, we use the formula:
Example: 1, 7, 7, 9.
Mean: .
Standard deviation:
Standardized data:

Standard units for multiple linear regression

The result of standardizing each feature (separately!) is that the units of each feature are on the same scale.
There’s no need to standardize the outcome (net sales), since it’s not being compared to anything.
Also, we can’t standardize the column of all 1s.
Then, solve the normal equations. The resulting are called the standardized regression coefficients.
Standardized regression coefficients can be directly compared to one another.
Note that standardizing each feature does not change the MSE of the resulting hypothesis function!

Once again, let’s try it out! Follow along in this notebook.

Feature engineering and transformations

Question: Would a linear hypothesis function work well on this dataset?

A quadratic hypothesis function

It looks like there’s some sort of quadratic relationship between horsepower and MPG in the last scatter plot. We want to try and fit a hypothesis function of the form:
Note that while this is quadratic in horsepower, it is linear in the parameters!
That is, it is a linear combination of features.
We can do that, by choosing our two “features” to be and , respectively.
In other words, and .
More generally, we can create new features out of existing features.

A quadratic hypothesis function

Desired hypothesis function: .
The resulting design matrix looks like:
To find the optimal parameter vector , we need to solve the normal equations!

More examples

What if we want to use a hypothesis function of the form: ?
What if we want to use a hypothesis function of the form: ?

Feature engineering

The process of creating new features out of existing information in our dataset is called feature engineering.
In this class, feature engineering will mostly be restricted to creating non-linear functions of existing features (as in the previous example).
In the future you’ll learn how to do other things, like encode categorical information.
You’ll be exposed to this in Homework 4, Problem 5!

Non-linear functions of multiple features

Recall our earlier example of predicting sales from square footage and number of competitors. What if we want a hypothesis function of the form:
The solution is to choose a design matrix accordingly:

Finding the optimal parameter vector,

As long as the form of the hypothesis function permits us to write for some and , the mean squared error is:
Regardless of the values of and , the value of that minimizes is the solution to the normal equations:

Linear in the parameters

We can fit rules like:
This includes arbitrary polynomials.
These are all linear combinations of (just) features.
We can’t fit rules like:
These are not linear combinations of just features!
We can have any number of parameters, as long as our hypothesis function is linear in the parameters, or linear when we think of it as a function of the parameters.

Determining function form

How do we know what form our hypothesis function should take?
Sometimes, we know from theory, using knowledge about what the variables represent and how they should be related.
Other times, we make a guess based on the data.
Generally, start with simpler functions first.
Remember, the goal is to find a hypothesis function that will generalize well to unseen data.

Example: Amdahl’s Law

Amdahl’s Law relates the runtime of a program on processors to the time to do the sequential and nonsequential parts on one processor.
Collect data by timing a program with varying numbers of processors:

Processors	Time (Hours)
1	8
2	4
4	3

Example: Fitting

Processors	Time (Hours)
1	8
2	4
4	3

How do we fit hypothesis functions that aren’t linear in the parameters?

Suppose we want to fit the hypothesis function:
This is not linear in terms of and , so our results for linear regression don’t apply.
Possible solution: Try to apply a transformation.

Transformations

Question: Can we re-write as a hypothesis function that is linear in the parameters?

Transformations

Solution: Create a new hypothesis function, , with parameters and , where .
This hypothesis function is related to by the relationship .
is related to by and .
Our new observation vector, , is .
is linear in its parameters, and .
Use the solution to the normal equations to find , and the relationship between and to find .

Once again, let’s try it out! Follow along in this notebook.

Non-linear hypothesis functions in general

Sometimes, it’s just not possible to transform a hypothesis function to be linear in terms of some parameters.
In those cases, you’d have to resort to other methods of finding the optimal parameters.
For example, can’t be transformed to be linear.
But, there are other methods of minimizing mean squared error:
One method: gradient descent, the topic of the next lecture!
Hypothesis functions that are linear in the parameters are much easier to work with.

Roadmap

This is the end of the content that’s in scope for the Midterm Exam.
On Thursday, we’ll introduce gradient descent, a technique for minimizing functions that can’t be minimized directly using calculus or linear algebra.
After the Midterm Exam, we’ll:
Look at a technique for identifying patterns in data when there is no “right answer” , called clustering.
Switch gears to probability.

Carter's Digital Garden

Explorer

DSC 40A Lecture 9

Multiple Linear Regression

Multiple linear regression

Incorporating multiple features

Geometric interpretation

The setup

The hypothesis vector

Finding the optimal parameters

Notation for multiple linear regression

Augmented feature vectors

The general problem

The general solution

Terminology for parameters

Interpreting parameters

Example: Predicting sales

Which features are most “important”?

Standard units

Standard units for multiple linear regression

Feature engineering and transformations

A quadratic hypothesis function

A quadratic hypothesis function

More examples

Feature engineering

Non-linear functions of multiple features

Finding the optimal parameter vector,

Linear in the parameters

Determining function form

Example: Amdahl’s Law

Example: Fitting

How do we fit hypothesis functions that aren’t linear in the parameters?

Transformations

Transformations

Non-linear hypothesis functions in general

Roadmap

Graph View

Table of Contents

Backlinks