Source: https://dsc40a.com/resources/lectures/lec04/lec04-filled.pdf
Simple Linear Regression
Recap - Center and Spread
The relationship between and
- Recall, for a general loss function
and the constant model empirical risk is of the form:
, the value of that minimizes empirical risk, represents the center of the dataset in some way. , the smallest possible value of empirical risk, represents the spread of the dataset in some way. - The specific center and spread depend on the choice of loss function.
Examples
When using squared loss:
. .
When using absolute loss:
. .
0-1 Loss
- The empirical risk for the 0-1 loss is:
- This is the proportion (between 0 and 1) of data points not equal to
. is minimized when . - Therefore,
is the proportion of data points not equal to the mode. - Example: What’s the proportion of values not equal to the mode in the dataset
?
A poor way to measure spread
- The minimum value of
is the proportion of data points not equal to the mode. - A higher value means less of the data is clustered at the mode.
- Just as the mode is a very basic way of measuring the center of the data,
is a very basic and uninformative way of measuring spread.
Summary of center and spread
- Different loss functions
lead to different empirical risk functions , which are minimized at various measures of center. - The minimum values of empirical risk,
, are various measures of spread. - larger values of spread
data is more spread out - There are many different ways to measure both center and spread; these are sometimes called descriptive statistics.
Simple linear regression
What’s next?
- In Lecture 1, we introduced the idea of a hypothesis function,
. - We’ve focused on finding the best constant model,
. - Now that we understand the modeling recipe, we can apply it to find the best simple linear regression model,
. - This will allow us to make predictions that aren’t all the same for every data point.
Recap: Hypothesis functions and parameters
A hypothesis function,
Parameters define the relationship between the input and output of a hypothesis function.
The simple linear regression model,
The modeling recipe
- Choose a model.
Before:
Now: - Choose a loss function.
- Minimize average loss to find optimal model parameters.
Minimizing mean squared error for the simple linear model
- We’ll choose squared loss, since it’s the easiest to minimize.
- Our goall then, is to find the linear hypothesis function
that minimizes empirical risk:
- Since linear hypothesis functions are of the form
, we can re-write as a function of and :
- How do we find parameters
and that minimize ?
Loss surface
For the constant model, the graph of
What does the graph of
Minimizing mean squared error for the simple linear model
Minimizing multivariate functions
- Our goal is to find the parameters
and that minimize mean squared error:
is a function of two variables: and . - To minimize a function of multiple variables:
- Take partial derivatives with respect to each variable.
- Set all partial derivatives to 0.
- Solve the resulting system of equations.
- Ensure that you’ve found a minimum, rather than a maximum or saddle point (using the second derivative test for multivariate functions).
Example. Find the point at which the following function is minimized.
Minimized at
Minimizing mean squared error
To find the
- Find
and set it equal to 0. - Find
and set it equal to 0. - Solve the resulting system of equations.
Strategy
We have a system of two equations and two unknowns (
To proceed, we’ll:
- Solve for
in the first equation.
The result becomes, because it’s the “best intercept.” - Plug
into the second equation and solve for .
The result becomes, because it’s the “best slope.”
Solving for
Solving for
Goal: isolate
Use
An equivalent formula for
Claim:
Key idea:
Proof:
denominator follows a similar argument
Least squares solutions
- The least squares solutions for the intercept
and slope are:
- We say
and are optimal parameters, and the resulting line is called the regression line. - The process of minimizing empirical risk to find optimal parameters is also called “fitting to the data.”
- To make predictions about the future, we use
.
Causality
Can we conclude that leaving later causes you to get to school earlier?
No! This is just an observed pattern