So far, we’ve fit simple linear regression models, which use only one feature ('departure_hour') for making predictions.
Incorporating multiple features
In the context of the commute times dataset, the simple linear regression model we fit was of the form:
Now, we’ll try and fit a multiple linear regression model of the form:
Linear regression with multiple features is called multiple linear regression.
How do we find , , and ?
Geometric interpretation
The hypothesis function:
looks like a line in 2D.
Questions:
How many dimensions do we need to graph the hypothesis function:
What is the shape of the hypothesis function?
Our new hypothesis function is a plane in 3D! Our goal is to find the plane of best fit that pierces through the cloud of points.
The setup
Suppose we have the following dataset.
We can represent each day with a feature vector, :
The hypothesis vector
When our hypothesis function is of the form:
the hypothesis vector can be written as:
Finding the optimal parameters
To find the optimal parameter vector, , we can use the design matrix and observation vector:
Then, all we need to do is solve the normal equations:If is invertible, we know the solution is:
Notation for multiple linear regression
We will need to keep track of multiple features for every individual in our dataset.
In practice, we could have hundreds or thousands of features!
As before, subscripts distinguish between individuals in our dataset. We have individuals, also called training examples.
Superscripts distinguish between features. We have features.
Think of , , … as new variable names, like new letters.
Augmented feature vectors
The augmented feature vector is the vector obtained by adding a 1 to the front of feature vector :
Then, our hypothesis function is:
The general problem
We have data points, ,
where each is a feature vector of features:
We want to find a good linear hypothesis function:
The general solution
Define the design matrix and observation vector:
Then, solve the normal equations to find the optimal parameter vector, :
Terminology for parameters
With features, has entries.
is the bias, also known as the intercept.
each give the weight, or coefficient, or slope, of a feature.
Interpreting parameters
Example: Predicting sales
For each of 26 stores, we have:
net sales,
square feet,
inventory,
advertising expenditure,
district size, and
number of competing stores.
Goal: Predict net sales given the other five features.
To begin, we’ll start trying to fit the hypothesis function to predict sales:
Which features are most “important”?
The most important feature is not necessarily the feature with largest magnitude weight.
Features are measured in different units, i.e. different scales.
Suppose I fit one hypothesis function, , with sales in US dollars, and another hypothesis function, , with sales in Japanese yen (1 USD 157 yen).
Sales is just as important in both hypothesis functions.
But the weight of sales in will be 157 times larger than the weight of sales in .
Solution: If you care about the interpretability of the resulting weights, standardize each feature before performing regression, i.e. convert each feature to standard units.
Standard units
Recall: to convert a feature to standard units, we use the formula:
Example: 1, 7, 7, 9.
Mean: .
Standard deviation:
Standardized data:
Standard units for multiple linear regression
The result of standardizing each feature (separately!) is that the units of each feature are on the same scale.
There’s no need to standardize the outcome (net sales), since it’s not being compared to anything.
Also, we can’t standardize the column of all 1s.
Then, solve the normal equations. The resulting are called the standardized regression coefficients.
Standardized regression coefficients can be directly compared to one another.
Note that standardizing each feature does not change the MSE of the resulting hypothesis function!
Once again, let’s try it out! Follow along in this notebook.
Feature engineering and transformations
Question: Would a linear hypothesis function work well on this dataset?
A quadratic hypothesis function
It looks like there’s some sort of quadratic relationship between horsepower and MPG in the last scatter plot. We want to try and fit a hypothesis function of the form:
Note that while this is quadratic in horsepower, it is linear in the parameters!
That is, it is a linear combination of features.
We can do that, by choosing our two “features” to be and , respectively.
In other words, and .
More generally, we can create new features out of existing features.
A quadratic hypothesis function
Desired hypothesis function: .
The resulting design matrix looks like:
To find the optimal parameter vector , we need to solve the normal equations!
More examples
What if we want to use a hypothesis function of the form: ?
What if we want to use a hypothesis function of the form: ?
Feature engineering
The process of creating new features out of existing information in our dataset is called feature engineering.
In this class, feature engineering will mostly be restricted to creating non-linear functions of existing features (as in the previous example).
In the future you’ll learn how to do other things, like encode categorical information.
You’ll be exposed to this in Homework 4, Problem 5!
Non-linear functions of multiple features
Recall our earlier example of predicting sales from square footage and number of competitors. What if we want a hypothesis function of the form:
The solution is to choose a design matrix accordingly:
Finding the optimal parameter vector,
As long as the form of the hypothesis function permits us to write for some and , the mean squared error is:
Regardless of the values of and , the value of that minimizes is the solution to the normal equations:
Linear in the parameters
We can fit rules like:
This includes arbitrary polynomials.
These are all linear combinations of (just) features.
We can’t fit rules like:
These are not linear combinations of just features!
We can have any number of parameters, as long as our hypothesis function is linear in the parameters, or linear when we think of it as a function of the parameters.
Determining function form
How do we know what form our hypothesis function should take?
Sometimes, we know from theory, using knowledge about what the variables represent and how they should be related.
Other times, we make a guess based on the data.
Generally, start with simpler functions first.
Remember, the goal is to find a hypothesis function that will generalize well to unseen data.
Example: Amdahl’s Law
Amdahl’s Law relates the runtime of a program on processors to the time to do the sequential and nonsequential parts on one processor.
Collect data by timing a program with varying numbers of processors:
Processors
Time (Hours)
1
8
2
4
4
3
Example: Fitting
Processors
Time (Hours)
1
8
2
4
4
3
How do we fit hypothesis functions that aren’t linear in the parameters?
Suppose we want to fit the hypothesis function:
This is not linear in terms of and , so our results for linear regression don’t apply.
Possible solution: Try to apply a transformation.
Transformations
Question: Can we re-write as a hypothesis function that is linear in the parameters?
Transformations
Solution: Create a new hypothesis function, , with parameters and , where .
This hypothesis function is related to by the relationship .
is related to by and .
Our new observation vector, , is .
is linear in its parameters, and .
Use the solution to the normal equations to find , and the relationship between and to find .
Once again, let’s try it out! Follow along in this notebook.
Non-linear hypothesis functions in general
Sometimes, it’s just not possible to transform a hypothesis function to be linear in terms of some parameters.
In those cases, you’d have to resort to other methods of finding the optimal parameters.
For example, can’t be transformed to be linear.
But, there are other methods of minimizing mean squared error:
One method: gradient descent, the topic of the next lecture!
Hypothesis functions that are linear in the parameters are much easier to work with.
Roadmap
This is the end of the content that’s in scope for the Midterm Exam.
On Thursday, we’ll introduce gradient descent, a technique for minimizing functions that can’t be minimized directly using calculus or linear algebra.
After the Midterm Exam, we’ll:
Look at a technique for identifying patterns in data when there is no “right answer” , called clustering.