Source: https://dsc40a.com/resources/lectures/lec03/lec03-blank.pdf

Comparing Loss Functions


Recap: Empirical risk minimization

Goal

We had one goal in Lecture 2: given a dataset of values from the past, find the best constant prediction to make.

Key idea: Different definitions of “best” give us different “best predictions.”
median, mean are both the best, under different conditions.

The modeling recipe

In Lecture 2, we made two full passes through our “modeling recipe.”

  1. Choose a model.
    constant model
  2. Choose a loss function.
    ,
  3. Minimize average loss to find optimal model parameters.
    Mean Squared Error
    Mean Absolute Error

Empirical risk minimization

  • The formal name for the process of minimizing average loss is empirical risk minimization.
  • Another name for “average loss” is empirical risk.
  • When we use the squared loss function , the corresponding empirical risk is mean squared error.
  • When we use the absolute loss function , the corresponding empirical risk is mean absolute error.

Empirical risk minimization, in general

Key idea: If is any loss function, the corresponding empirical risk is


Choosing a Loss Function

Now what?

  • We know that, for the consatnt model , the mean minimizes mean squared error.
  • We also know that, for the constant model , the median minimizes mean absolute error.
  • How does our choice of loss function impact the resulting optimal prediction?

Comparing the mean and median

  • Consider our example dataset of 5 commute times.
  • As of now, the median is 85 and the mean is 80.
  • What if we add 200 to the largest commute time, 92?
  • Now, the median is still 85 but the mean is 120
  • Key idea: The mean is quite sensitive to outliers.

Outliers

Below, is 10 times as big as , but is 100 times .
The result is that the mean is “pulled” in the direction of outliers, relative to the median.
As a result, we say the median is robust to outliers. But the mean was easier to solve for.

Balance points

Both the mean and median are “balance points” in the distribution.

  • The mean is the point where .
  • The median is the point where .

Why stop at squared loss?

Empirical Risk, Derivative of Empirical Risk, Minimizer
median
mean
???
???
???

Generalized loss

"p-norm of a vector"





For any , define the loss as follows:

The corresponding empirical risk is:

What value does approach as ?

The -axis is .
The -axis is , the optimal constant prediction for loss:

The midrange minimizes average loss!

On the previous slide, we saw that as , the minimizer of mean loss approached the midpont of the minimum and maximum values in the dataset, or the midrange.

  • As minimizes the “worst case” distance from any data point”. (Read more here)
  • If your measure of “good” is “not far from any one data point”, then the midrange is the best prediction.

mean = 5, worst case distance =
median = 2.5, worst case distance =
midrange = 7.5, worst case distance =

Another example loss

Consider, for example, the 0-1 loss:

The corresponding empirical risk is:

Proportion of points NOT equal to

Minimizing empirical risk for 0-1 loss


= proportion of points NOT equal to
Minimized when as often as possible
set
most common value
not usually unique!

Summary: Choosing a loss function

Key idea: Different lsos functions lead to different best predictions, !

LossMinimizerAlways Unique?Robust to Outliers?Differentiable?
meanyesnoyes
mediannoyesno
midrangeyesnono
modenoyesno

The optimal predictions, , are all summary statistics that measure the center fo the dataset in different ways.


Center and Spread

What does it mean?

  • The general form of empirical risk, for any loss function , is:
  • As we just saw, the input that minimizes is some measure of the center of the dataset.
  • The minimum output, , represents some measure of the spread, or variation, in the dataset.

Squared loss

  • The empirical risk for squared loss, i.e. mean squared error, is:
  • is minimized when
  • Therefore, the minimum value of is:

Variance

  • The minimum value of is the mean squared deviation from the mean, more commonly known as the variance.

    TODO
  • It measures the squared

Absolute loss

  • The empirical risk for absolute loss, i.e. mean absolute error, is:
    TODO
  • is minimized when
  • Therefore, the minimum value of is:

Mean absolute deviation from the median

  • The minimum value of is the mean absolute deviation from the median.
    MAD from the median(-Median(
  • It measures how far each data point is from the median, on average.
  • Example: What is the MAD from the median in the dataset 2,3,3,4,5?

Summary of center and spread

  • Different loss functions lead to different empirical risk functions , which are minimized at various measures of center.
  • The minimum values of empirical risk, are various measures of spread.
  • There are many different ways to measure both center and spread; these are sometimes called descriptive statistics.