We started by introducing the idea of a hypothesis function, .
We looked at two possible models:
The constant model, .
The simple linear regression model, .
We decided to find the best constant prediction to use for predicting commute times, in minutes.
Mean squared error
Let’s suppose we have just a smaller dataset of just five historical commute times in minutes.
The mean squared error of the constant prediction is:
For example, if we predict , then:
We can pick any as a prediction, but the smaller is, the better is!
The best prediction
Suppose we collect commute times, .
The mean squared error of the prediction is:
We want the best prediction, .
The smaller is, the better is.
Goal: Find the that minimizes .
The resulting will be called .
How do we find ? using calculus!
Minimizing mean squared error
Minimizing using calculus
We’d like to minimize:
In order to minimize , we:
take its derivative with respect to ,
set it equal to 0,
solve for the resulting , and
perform a second derivative test to ensure that we found a minimum
Step 0: The derivative of
Remeber from calculus that:
if , then
.
This is relevant because involves the sum of individual terms, each of which involve .
So, to take the derivative of , we’ll first need to find the derivative of .
Step 1: The derivative of
Step 2 and 3: Set to 0 and solve for the minimzer,
multiply both sides by ( times)
Step 4: Second derivative test
We already saw that is convex i.e. that it opens upwards, so the we found must be a minimum, not a maximum.
So, opens upwards (convex), and we found a minimizer!
The mean minimzes mean squared error!
The problem we set out to solve was, find the that minimizes:
The answer is:
The best constant prediction, in terms of mean squared error, is always the mean.
We call our optimal model parameter, for when we use:
the constant model, , and
the squared loss function, .
Aside: Notation
Another way of writing
is is the solution to an optimization problem.
The modeling recipe
We’ve implicitly introduced a three-step process for finding optimal model parameters (like ) that we can use for making predictions.
Choose a model.
Choose a loss function.
Minimize average loss to find optimal model parameters.
Another loss function
Another loss function
Last lecture, we started by computing the error for each of our predictions, but ran into the issue that some errors were positive and some were negative.
THe solution was to square the errors, so that all are non-negative. The resulting loss function is called squared loss.
Another loss function, which also measures how far is from , is absolute loss.
Squared loss vs. absolute loss
For the constant model, , so we can simplify our loss functions as follows:
Squared loss: .
Absolute loss: .
Consider again, our example dataset of five commute times, and the **prediction .
When we use squared loss, is the point at which the average squared loss is minimized.
When we use absolute loss, is the point at which the average absolute loss is minimized.
Mean absolute error
Suppose we collect commute times, .
The average absolute loss, or mean absolute error (MAE), of the prediction is:
We’d like to find the best prediction, .
Previously, we used calculus to find the optimal model parameter that minimized - that is, when using squared loss.
Can we use calculus to minimize , too?
Minimizing mean absolute error
Minimizing using calculus, again
We’d like to minimize:
In order to minimize , we:
take its derivative with respect to ,
set it equal to 0,
solve for the resulting , and
perform a second derivative test to ensure we found a minimum.
Step 0: The “derivative” of
Remember that is a piecewise linear function of :
So, is also a piecewise linear function of :
What is ?
Step 1: The “derivative” of
a sum of a bunch of +1s and -1s!
we +1 whenever , and
we -1 whenever slope of mean absolute error (MAE)
Step 2 and 3: Set to 0 and solve for the minimizer,
multiply both sides by
The that minimizes mean absolute error is the one where
# points to the left of h = # points to the right of h median!
The median minimizes mean absolute error!
The new problem we set out to solve was find the that minimizes:
The answer is:
This is because the median has an equal number of data points to the left of it and to the right of it.
To make a bit more sense of this result, let’s graph .
Visualizing mean absolute error
Consider, again, our example dataset of five commute times.
Where are the “bends” in the graph of - that is, where does its slope change?
The bends are at data points!
Visualizing mean absolute error, with an even number of points
What if we add a sixth data point?
Is there a unique ?
No, is not unique!
Any value in the range minimizes mean absolute error!
Convex functions - the critical point is guaranteed to be a minimum
The median minimizes mean absolute error!
The new problem we set out to solve was, find the that minimizes:
The answer is:
The best constant prediciton, in terms of mean absolute error, is always the median.
When is odd, this answer is unique.
When is even, any number between the middle two data points (when sorted) also minimizes mean absolute error.
When is even, define the median to be the mean of the middle two data points.
The modeling recipe, again
We’ve now made two full passes through our “modeling recipe.”
Choose a model
Choose a loss function. ,
Minimize average loss to find optimal model parameters. ,
Empirical risk minimization
The formal name for the process of minimizing average loss is empirical risk minimization.
Another name for “average loss” is empirical risk.
When we use the squared loss function, , the corresponding empirical risk is mean squared error:
When we use the absolute loss function, , the corresponding empirical risk is mean absolute error:
Empirical risk minimization, in general
Key idea: If is any loss function, the corresponding empirical risk is:
Summary, next time
minimizes mean squared error, .
minimizes mean absolute error, .
and are examples of empirical risk - that is, average loss.
Next time: What’s the relationship between the mean and median? What is the significance of and ?
Practice exam problem
Define the extreme mean (EM) of a dataset to be the average of its largest and smallest values. Let .
Show that for any dataset ,