Source: https://dsc40a.com/resources/lectures/lec03/lec03-blank.pdf

Comparing Loss Functions

Recap: Empirical risk minimization

Goal

We had one goal in Lecture 2: given a dataset of values from the past, find the best constant prediction to make.

Key idea: Different definitions of “best” give us different “best predictions.”
median, mean are both the best, under different conditions.

The modeling recipe

In Lecture 2, we made two full passes through our “modeling recipe.”

Choose a model.
constant model
Choose a loss function.
,
Minimize average loss to find optimal model parameters.
Mean Squared Error
Mean Absolute Error

Empirical risk minimization

The formal name for the process of minimizing average loss is empirical risk minimization.
Another name for “average loss” is empirical risk.
When we use the squared loss function , the corresponding empirical risk is mean squared error.
When we use the absolute loss function , the corresponding empirical risk is mean absolute error.

Empirical risk minimization, in general

Key idea: If is any loss function, the corresponding empirical risk is

Choosing a Loss Function

Now what?

We know that, for the consatnt model , the mean minimizes mean squared error.
We also know that, for the constant model , the median minimizes mean absolute error.
How does our choice of loss function impact the resulting optimal prediction?

Comparing the mean and median

Consider our example dataset of 5 commute times.
As of now, the median is 85 and the mean is 80.
What if we add 200 to the largest commute time, 92?
Now, the median is still 85 but the mean is 120
Key idea: The mean is quite sensitive to outliers.

Outliers

Below, is 10 times as big as , but is 100 times .
The result is that the mean is “pulled” in the direction of outliers, relative to the median.
As a result, we say the median is robust to outliers. But the mean was easier to solve for.

Balance points

Both the mean and median are “balance points” in the distribution.

The mean is the point where .
The median is the point where .

Why stop at squared loss?

Empirical Risk,	Derivative of Empirical Risk,	Minimizer
		median
		mean
		???
		???
		???
…	…	…

Generalized loss

"p-norm of a vector"

For any , define the loss as follows:

The corresponding empirical risk is:

What value does approach as ?

The -axis is .
The -axis is , the optimal constant prediction for loss:

The midrange minimizes average loss!

On the previous slide, we saw that as , the minimizer of mean loss approached the midpont of the minimum and maximum values in the dataset, or the midrange.

As minimizes the “worst case” distance from any data point”. (Read more here)
If your measure of “good” is “not far from any one data point”, then the midrange is the best prediction.

mean = 5, worst case distance =
median = 2.5, worst case distance =
midrange = 7.5, worst case distance =

Another example loss

Consider, for example, the 0-1 loss:

The corresponding empirical risk is:

Proportion of points NOT equal to

Minimizing empirical risk for 0-1 loss

= proportion of points NOT equal to
Minimized when as often as possible
set
most common value
not usually unique!

Summary: Choosing a loss function

Key idea: Different lsos functions lead to different best predictions, !

Minimizer	Always Unique?	Robust to Outliers?	Differentiable?
mean	yes	no	yes
median	no	yes	no
midrange	yes	no	no
mode	no	yes	no

The optimal predictions, , are all summary statistics that measure the center fo the dataset in different ways.

Center and Spread

What does it mean?

The general form of empirical risk, for any loss function , is:
As we just saw, the input that minimizes is some measure of the center of the dataset.
The minimum output, , represents some measure of the spread, or variation, in the dataset.

Squared loss

The empirical risk for squared loss, i.e. mean squared error, is:
is minimized when
Therefore, the minimum value of is:

Variance

The minimum value of is the mean squared deviation from the mean, more commonly known as the variance.

TODO
It measures the squared

Absolute loss

The empirical risk for absolute loss, i.e. mean absolute error, is:
TODO
is minimized when
Therefore, the minimum value of is:

Mean absolute deviation from the median

The minimum value of is the mean absolute deviation from the median.
MAD from the median(-Median(
It measures how far each data point is from the median, on average.
Example: What is the MAD from the median in the dataset 2,3,3,4,5?

Summary of center and spread

Different loss functions lead to different empirical risk functions , which are minimized at various measures of center.
The minimum values of empirical risk, are various measures of spread.
There are many different ways to measure both center and spread; these are sometimes called descriptive statistics.

Carter's Digital Garden

Explorer

DSC 40A Lecture 3

Comparing Loss Functions

Recap: Empirical risk minimization

Goal

The modeling recipe

Empirical risk minimization

Empirical risk minimization, in general

Choosing a Loss Function

Now what?

Comparing the mean and median

Outliers

Balance points

Why stop at squared loss?

Generalized loss

What value does approach as ?

The midrange minimizes average loss!

Another example loss

Minimizing empirical risk for 0-1 loss

Summary: Choosing a loss function

Center and Spread

What does it mean?

Squared loss

Variance

Absolute loss

Mean absolute deviation from the median

Summary of center and spread

Graph View

Table of Contents

Backlinks