Source: https://dsc40a.com/resources/lectures/lec01/lec01-blank.pdf

Introduction to Modeling

What is this class about?

Theoretical Foundations of Data Science I
Mathematical Foundations of Machine Learning

Why do we need to study theoretical foundations?
To understand and improve the tools we use.

Machine learning is about automatically learning patterns from data.
Humans are good at understanding handwriting - but how do we get computers to understand handwriting?

https://dsc40a.com
https://practice.dsc40a.com

Commute time example

Goal: Predict your commute time.
That is, predict how long it’ll take to get to school.
How can we do this? Learn from data
What will we need to assume? Data in the future looks like data from the past

A model is a set of assumptions about how data (plural) were generated.
”datum” singular
”All models are wrong but some are useful”

Possible Models Of Commuting Time

  • Simple linear regression model - Assumes number of minutes it takes for you to get to school is a linear function of the time that you leave home.
  • Constant model - Assumes all commute times are constant no matter what time you leave during the day.

Notation

: “input”, “independent variable”, or “feature”
: “response”, “dependent variable”, or “target”
We use to predict .
The th observation is denoted .

Hypothesis functions

A hypothesis function, , takes in an as an input and returns a predicted .

Parameters

Parameters define the relationship between the input and output of a hypothesis function.

The constant model, , has one parameter: .

  • Use to make predictions (i.e. hypothesis model)

The simple linear regression model, , has two parameters: and .

  • Fancy
  • intercept
  • slope
  • Change parameters: change how predictions are made

The constant model

Where should we draw the horizontal line?

Transitioning from scatter plot to histogram

  • Horizontal line to vertical line

A concrete example

Smaller dataset of just five historical commute times in minutes. How can you come up with a prediction for your future commute time?
Possible Strategies: Average, Median, Midrange

Summary Statistics

Summarize a collection of numbers with a single number

The cost of making predictions

Loss function

A loss function quantifies how bad a prediction is for a single data point.

  • If our prediction is close to the actual value, we should have a low loss.
  • If our prediction is far from the actual value, we should have high loss.

A good starting point is error, which is the difference between actual and predicted values.

Suppose my commute actually takes 80 minutes.

  • If I predict 75 minutes: error:
  • If I predict 72 minutes: error:
  • If I predict 100 minutes: error:

Issue: Some errors are negative, others are positive

Squared loss

One loss function is squared loss, , which computes (actual - predicted).

Note that for the constant model , so we can simplify this to:

Note:
Squared loss is note the only loss function that exists. It is popular because it is differentiable. Absolute loss, another loss function, is not differentiable.

A concrete example, revisited

Suppose we predict the median, . What is the squared lsos of for each data point?





Goal: Find a single number that describes the loss for the prediction on my entire dataset

Averaging squared losses

We’d like a single number that describes the quality of our predictions across our entire dataset. One way to compute this is as the average of the squared losses.

  • For the median, :
  • For the mean, :

80 is the better prediction, since its average squared loss is lower.

: loss for one data point
: average loss for all data points

Mean squared error

L: loss for one data point
R: average loss over all data points

  • Another term for average squared loss is mean squared error (MSE).
  • The mean squared error on our smaller dataset for any prediction is of the form:

    stands for “risk”, as in “empirical risk.”
  • For example, if we predict , then:
  • We can pick any as a prediction, but the smaller is, the better is

Visualizing mean squared error


is a parabola centered at
We want to minimize MSE

Mean squared error, in general

  • Suppose we colelct commute times, .
  • The mean squared error of the prediction is:
  • Or, using summation notation:
total = 0  
for i in range(1, n+1):  
  total += (y[i]-h)**2  
total = total / n  

The best prediction

  • We want the best prediction, .
  • The smaller is, the better is.
  • Goal: Find the that minimizes .
    The resulting will be called .
  • How do we find ?

Summary

  • We started with the abstract problem:

Given historical commute times, predict your future commute time.

  • We’ve turned it into a formal optimization problem:

Find the prediction that has the smallest mean squared error on the data.

  • Implicitly, we introduced a three-step modeling process that we’ll keep revisiting:
    1. Choose a model.
    2. Choose a loss function.
    3. Minimize average loss, .
  • Next time: We’ll solve this optimization problem by-hand