Source: https://dsc40a.com/resources/lectures/lec03/lec03-blank.pdf
Comparing Loss Functions
Recap: Empirical risk minimization
Goal
We had one goal in Lecture 2: given a dataset of values from the past, find the best constant prediction to make.
Key idea: Different definitions of “best” give us different “best predictions.”
median, mean are both the best, under different conditions.
The modeling recipe
In Lecture 2, we made two full passes through our “modeling recipe.”
- Choose a model.
constant model - Choose a loss function.
, - Minimize average loss to find optimal model parameters.
Mean Squared Error
Mean Absolute Error
Empirical risk minimization
- The formal name for the process of minimizing average loss is empirical risk minimization.
- Another name for “average loss” is empirical risk.
- When we use the squared loss function
, the corresponding empirical risk is mean squared error.
- When we use the absolute loss function
, the corresponding empirical risk is mean absolute error.
Empirical risk minimization, in general
Key idea: If
Choosing a Loss Function
Now what?
- We know that, for the consatnt model
, the mean minimizes mean squared error. - We also know that, for the constant model
, the median minimizes mean absolute error. - How does our choice of loss function impact the resulting optimal prediction?
Comparing the mean and median
- Consider our example dataset of 5 commute times.
- As of now, the median is 85 and the mean is 80.
- What if we add 200 to the largest commute time, 92?
- Now, the median is still 85 but the mean is 120
- Key idea: The mean is quite sensitive to outliers.
Outliers
Below,
The result is that the mean is “pulled” in the direction of outliers, relative to the median.
As a result, we say the median is robust to outliers. But the mean was easier to solve for.
Balance points
Both the mean and median are “balance points” in the distribution.
- The mean is the point where
.
- The median is the point where
.
Why stop at squared loss?
Empirical Risk, | Derivative of Empirical Risk, | Minimizer |
---|---|---|
median | ||
mean | ||
??? | ||
??? | ||
??? | ||
… | … | … |
Generalized loss
"p-norm of a vector"
For any
The corresponding empirical risk is:
What value does approach as ?
The
The
The midrange minimizes average loss!
On the previous slide, we saw that as
- As
minimizes the “worst case” distance from any data point”. (Read more here) - If your measure of “good” is “not far from any one data point”, then the midrange is the best prediction.
mean = 5, worst case distance =
median = 2.5, worst case distance =
midrange = 7.5, worst case distance =
Another example loss
Consider, for example, the 0-1 loss:
The corresponding empirical risk is:
Proportion of points NOT equal to
Minimizing empirical risk for 0-1 loss
= proportion of points NOT equal to
Minimized when
most common value
Summary: Choosing a loss function
Key idea: Different lsos functions lead to different best predictions,
Loss | Minimizer | Always Unique? | Robust to Outliers? | Differentiable? |
---|---|---|---|---|
mean | yes | no | yes | |
median | no | yes | no | |
midrange | yes | no | no | |
mode | no | yes | no |
The optimal predictions,
Center and Spread
What does it mean?
- The general form of empirical risk, for any loss function
, is:
- As we just saw, the input
that minimizes is some measure of the center of the dataset. - The minimum output,
, represents some measure of the spread, or variation, in the dataset.
Squared loss
- The empirical risk for squared loss, i.e. mean squared error, is:
is minimized when - Therefore, the minimum value of
is:
Variance
- The minimum value of
is the mean squared deviation from the mean, more commonly known as the variance.
TODO - It measures the squared
Absolute loss
- The empirical risk for absolute loss, i.e. mean absolute error, is:
TODO is minimized when - Therefore, the minimum value of
is:
Mean absolute deviation from the median
- The minimum value of
is the mean absolute deviation from the median.
MAD from the median(-Median( - It measures how far each data point is from the median, on average.
- Example: What is the MAD from the median in the dataset 2,3,3,4,5?
Summary of center and spread
- Different loss functions
lead to different empirical risk functions , which are minimized at various measures of center. - The minimum values of empirical risk,
are various measures of spread. - There are many different ways to measure both center and spread; these are sometimes called descriptive statistics.