Source: https://dsc40a.com/resources/lectures/lec09/lec09-filled.pdf

Multiple Linear Regression

Multiple linear regression

So far, we’ve fit simple linear regression models, which use only one feature ('departure_hour') for making predictions.

Incorporating multiple features

  • In the context of the commute times dataset, the simple linear regression model we fit was of the form:
  • Now, we’ll try and fit a multiple linear regression model of the form:
  • Linear regression with multiple features is called multiple linear regression.
  • How do we find , , and ?

Geometric interpretation

  • The hypothesis function:

    looks like a line in 2D.
  • Questions:
  • How many dimensions do we need to graph the hypothesis function:
  • What is the shape of the hypothesis function?

Our new hypothesis function is a plane in 3D! Our goal is to find the plane of best fit that pierces through the cloud of points.

The setup

  • Suppose we have the following dataset.
  • We can represent each day with a feature vector, :

The hypothesis vector

  • When our hypothesis function is of the form:

    the hypothesis vector can be written as:

Finding the optimal parameters

  • To find the optimal parameter vector, , we can use the design matrix and observation vector :
  • Then, all we need to do is solve the normal equations:If is invertible, we know the solution is:

Notation for multiple linear regression

  • We will need to keep track of multiple features for every individual in our dataset.
  • In practice, we could have hundreds or thousands of features!
  • As before, subscripts distinguish between individuals in our dataset. We have individuals, also called training examples.
  • Superscripts distinguish between features. We have features.


    Think of , , … as new variable names, like new letters.

Augmented feature vectors

  • The augmented feature vector is the vector obtained by adding a 1 to the front of feature vector :
  • Then, our hypothesis function is:

The general problem

  • We have data points, ,
    where each is a feature vector of features:
  • We want to find a good linear hypothesis function:

The general solution

  • Define the design matrix and observation vector :
  • Then, solve the normal equations to find the optimal parameter vector, :

Terminology for parameters

  • With features, has entries.
  • is the bias, also known as the intercept.
  • each give the weight, or coefficient, or slope, of a feature.

Interpreting parameters

Example: Predicting sales

  • For each of 26 stores, we have:
  • net sales,
  • square feet,
  • inventory,
  • advertising expenditure,
  • district size, and
  • number of competing stores.
  • Goal: Predict net sales given the other five features.
  • To begin, we’ll start trying to fit the hypothesis function to predict sales:

Which features are most “important”?

  • The most important feature is not necessarily the feature with largest magnitude weight.
  • Features are measured in different units, i.e. different scales.
  • Suppose I fit one hypothesis function, , with sales in US dollars, and another hypothesis function, , with sales in Japanese yen (1 USD 157 yen).
  • Sales is just as important in both hypothesis functions.
  • But the weight of sales in will be 157 times larger than the weight of sales in .
  • Solution: If you care about the interpretability of the resulting weights, standardize each feature before performing regression, i.e. convert each feature to standard units.

Standard units

  • Recall: to convert a feature to standard units, we use the formula:
  • Example: 1, 7, 7, 9.
  • Mean: .
  • Standard deviation:
  • Standardized data:

Standard units for multiple linear regression

  • The result of standardizing each feature (separately!) is that the units of each feature are on the same scale.
  • There’s no need to standardize the outcome (net sales), since it’s not being compared to anything.
  • Also, we can’t standardize the column of all 1s.
  • Then, solve the normal equations. The resulting are called the standardized regression coefficients.
  • Standardized regression coefficients can be directly compared to one another.
  • Note that standardizing each feature does not change the MSE of the resulting hypothesis function!

Once again, let’s try it out! Follow along in this notebook.


Feature engineering and transformations

Question: Would a linear hypothesis function work well on this dataset?

A quadratic hypothesis function

  • It looks like there’s some sort of quadratic relationship between horsepower and MPG in the last scatter plot. We want to try and fit a hypothesis function of the form:
  • Note that while this is quadratic in horsepower, it is linear in the parameters!
  • That is, it is a linear combination of features.
  • We can do that, by choosing our two “features” to be and , respectively.
  • In other words, and .
  • More generally, we can create new features out of existing features.

A quadratic hypothesis function

  • Desired hypothesis function: .
  • The resulting design matrix looks like:
  • To find the optimal parameter vector , we need to solve the normal equations!

More examples

  • What if we want to use a hypothesis function of the form: ?



  • What if we want to use a hypothesis function of the form: ?

Feature engineering

  • The process of creating new features out of existing information in our dataset is called feature engineering.
  • In this class, feature engineering will mostly be restricted to creating non-linear functions of existing features (as in the previous example).
  • In the future you’ll learn how to do other things, like encode categorical information.
  • You’ll be exposed to this in Homework 4, Problem 5!

Non-linear functions of multiple features

  • Recall our earlier example of predicting sales from square footage and number of competitors. What if we want a hypothesis function of the form:
  • The solution is to choose a design matrix accordingly:

Finding the optimal parameter vector,

  • As long as the form of the hypothesis function permits us to write for some and , the mean squared error is:
  • Regardless of the values of and , the value of that minimizes is the solution to the normal equations:

Linear in the parameters

  • We can fit rules like:
  • This includes arbitrary polynomials.
  • These are all linear combinations of (just) features.
  • We can’t fit rules like:
  • These are not linear combinations of just features!
  • We can have any number of parameters, as long as our hypothesis function is linear in the parameters, or linear when we think of it as a function of the parameters.

Determining function form

  • How do we know what form our hypothesis function should take?
  • Sometimes, we know from theory, using knowledge about what the variables represent and how they should be related.
  • Other times, we make a guess based on the data.
  • Generally, start with simpler functions first.
  • Remember, the goal is to find a hypothesis function that will generalize well to unseen data.

Example: Amdahl’s Law

  • Amdahl’s Law relates the runtime of a program on processors to the time to do the sequential and nonsequential parts on one processor.
  • Collect data by timing a program with varying numbers of processors:
ProcessorsTime (Hours)
18
24
43

Example: Fitting

ProcessorsTime (Hours)
18
24
43

How do we fit hypothesis functions that aren’t linear in the parameters?

  • Suppose we want to fit the hypothesis function:
  • This is not linear in terms of and , so our results for linear regression don’t apply.
  • Possible solution: Try to apply a transformation.

Transformations

  • Question: Can we re-write as a hypothesis function that is linear in the parameters?

Transformations

  • Solution: Create a new hypothesis function, , with parameters and , where .
  • This hypothesis function is related to by the relationship .
  • is related to by and .
  • Our new observation vector, , is .
  • is linear in its parameters, and .
  • Use the solution to the normal equations to find , and the relationship between and to find .

Once again, let’s try it out! Follow along in this notebook.

Non-linear hypothesis functions in general

  • Sometimes, it’s just not possible to transform a hypothesis function to be linear in terms of some parameters.
  • In those cases, you’d have to resort to other methods of finding the optimal parameters.
  • For example, can’t be transformed to be linear.
  • But, there are other methods of minimizing mean squared error:
  • One method: gradient descent, the topic of the next lecture!
  • Hypothesis functions that are linear in the parameters are much easier to work with.

Roadmap

  • This is the end of the content that’s in scope for the Midterm Exam.
  • On Thursday, we’ll introduce gradient descent, a technique for minimizing functions that can’t be minimized directly using calculus or linear algebra.
  • After the Midterm Exam, we’ll:
  • Look at a technique for identifying patterns in data when there is no “right answer” , called clustering.
  • Switch gears to probability.