Source: https://dsc40a.com/resources/lectures/lec02/lec02-filled.pdf

Empirical Risk Minimization

Overview

We started by introducing the idea of a hypothesis function, .
We looked at two possible models:
The constant model, .
The simple linear regression model, .
We decided to find the best constant prediction to use for predicting commute times, in minutes.

Mean squared error

Let’s suppose we have just a smaller dataset of just five historical commute times in minutes.
The mean squared error of the constant prediction is:
For example, if we predict , then:
We can pick any as a prediction, but the smaller is, the better is!

The best prediction

Suppose we collect commute times, .
The mean squared error of the prediction is:
We want the best prediction, .
The smaller is, the better is.
Goal: Find the that minimizes .
The resulting will be called .
How do we find ? using calculus!

Minimizing mean squared error

Minimizing using calculus

We’d like to minimize:

In order to minimize , we:

take its derivative with respect to ,
set it equal to 0,
solve for the resulting , and
perform a second derivative test to ensure that we found a minimum

Step 0: The derivative of

Remeber from calculus that:
if , then
.
This is relevant because involves the sum of individual terms, each of which involve .
So, to take the derivative of , we’ll first need to find the derivative of .

Step 1: The derivative of

Step 2 and 3: Set to 0 and solve for the minimzer,

multiply both sides by
( times)

Step 4: Second derivative test

We already saw that is convex i.e. that it opens upwards, so the we found must be a minimum, not a maximum.

So, opens upwards (convex), and we found a minimizer!

The mean minimzes mean squared error!

The problem we set out to solve was, find the that minimizes:
The answer is:
The best constant prediction, in terms of mean squared error, is always the mean.
We call our optimal model parameter, for when we use:
the constant model, , and
the squared loss function, .

Aside: Notation

Another way of writing

is

is the solution to an optimization problem.

The modeling recipe

We’ve implicitly introduced a three-step process for finding optimal model parameters (like ) that we can use for making predictions.

Choose a model.
Choose a loss function.
Minimize average loss to find optimal model parameters.

Another loss function

Last lecture, we started by computing the error for each of our predictions, but ran into the issue that some errors were positive and some were negative.
THe solution was to square the errors, so that all are non-negative. The resulting loss function is called squared loss.
Another loss function, which also measures how far is from , is absolute loss.

Squared loss vs. absolute loss

For the constant model, , so we can simplify our loss functions as follows:

Squared loss: .
Absolute loss: .
Consider again, our example dataset of five commute times, and the **prediction .
When we use squared loss, is the point at which the average squared loss is minimized.
When we use absolute loss, is the point at which the average absolute loss is minimized.

Mean absolute error

Suppose we collect commute times, .
The average absolute loss, or mean absolute error (MAE), of the prediction is:
We’d like to find the best prediction, .
Previously, we used calculus to find the optimal model parameter that minimized - that is, when using squared loss.
Can we use calculus to minimize , too?

Minimizing mean absolute error

Minimizing using calculus, again

We’d like to minimize:

In order to minimize , we:

take its derivative with respect to ,
set it equal to 0,
solve for the resulting , and
perform a second derivative test to ensure we found a minimum.

Step 0: The “derivative” of

Remember that is a piecewise linear function of :

So, is also a piecewise linear function of :

What is ?

Step 1: The “derivative” of

a sum of a bunch of +1s and -1s!
we +1 whenever , and
we -1 whenever
slope of mean absolute error (MAE)

Step 2 and 3: Set to 0 and solve for the minimizer,

multiply both sides by

The that minimizes mean absolute error is the one where
# points to the left of h = # points to the right of h median!

The median minimizes mean absolute error!

The new problem we set out to solve was find the that minimizes:
The answer is:
This is because the median has an equal number of data points to the left of it and to the right of it.
To make a bit more sense of this result, let’s graph .

Visualizing mean absolute error

Consider, again, our example dataset of five commute times.

Where are the “bends” in the graph of - that is, where does its slope change?
The bends are at data points!

Visualizing mean absolute error, with an even number of points

What if we add a sixth data point?

Is there a unique ?
No, is not unique!
Any value in the range minimizes mean absolute error!

Convex functions - the critical point is guaranteed to be a minimum

The median minimizes mean absolute error!

The new problem we set out to solve was, find the that minimizes:
The answer is:
The best constant prediciton, in terms of mean absolute error, is always the median.
When is odd, this answer is unique.
When is even, any number between the middle two data points (when sorted) also minimizes mean absolute error.
When is even, define the median to be the mean of the middle two data points.

The modeling recipe, again

We’ve now made two full passes through our “modeling recipe.”

Choose a model
Choose a loss function.
,
Minimize average loss to find optimal model parameters.
,