Source: https://dsc40a.com/resources/lectures/lec01/lec01-blank.pdf
Introduction to Modeling
What is this class about?
Theoretical Foundations of Data Science I
Mathematical Foundations of Machine Learning
Why do we need to study theoretical foundations?
To understand and improve the tools we use.
Machine learning is about automatically learning patterns from data.
Humans are good at understanding handwriting - but how do we get computers to understand handwriting?
https://dsc40a.com
https://practice.dsc40a.com
Commute time example
Goal: Predict your commute time.
That is, predict how long it’ll take to get to school.
How can we do this?Learn from data
What will we need to assume?Data in the future looks like data from the past
A model is a set of assumptions about how data (plural) were generated.
”datum” singular
”All models are wrong but some are useful”
Possible Models Of Commuting Time
- Simple linear regression model - Assumes number of minutes it takes for you to get to school is a linear function of the time that you leave home.
- Constant model - Assumes all commute times are constant no matter what time you leave during the day.
Notation
: “input”, “independent variable”, or “feature”
: “response”, “dependent variable”, or “target”
We useto predict .
Theth observation is denoted .
Hypothesis functions
A hypothesis function,
, takes in an as an input and returns a predicted .
Parameters
Parameters define the relationship between the input and output of a hypothesis function.
The constant model,
- Use
to make predictions (i.e. hypothesis model)
The simple linear regression model,
- Fancy
intercept slope - Change parameters: change how predictions are made
The constant model
Where should we draw the horizontal line?
Transitioning from scatter plot to histogram
- Horizontal line to vertical line
A concrete example
Smaller dataset of just five historical commute times in minutes. How can you come up with a prediction for your future commute time?
Possible Strategies: Average, Median, Midrange
Summary Statistics
Summarize a collection of numbers with a single number
The cost of making predictions
Loss function
A loss function quantifies how bad a prediction is for a single data point.
- If our prediction is close to the actual value, we should have a low loss.
- If our prediction is far from the actual value, we should have high loss.
A good starting point is error, which is the difference between actual and predicted values.
Suppose my commute actually takes 80 minutes.
- If I predict 75 minutes: error:
- If I predict 72 minutes: error:
- If I predict 100 minutes: error:
Issue: Some errors are negative, others are positive
Squared loss
One loss function is squared loss,
, which computes (actual - predicted) .
Note that for the constant model, so we can simplify this to:
Note:
Squared loss is note the only loss function that exists. It is popular because it is differentiable. Absolute loss, another loss function, is not differentiable.
A concrete example, revisited
Suppose we predict the median,
. What is the squared lsos of for each data point?
Goal: Find a single number that describes the loss for the predictionon my entire dataset
Averaging squared losses
We’d like a single number that describes the quality of our predictions across our entire dataset. One way to compute this is as the average of the squared losses.
- For the median,
:
- For the mean,
:
80 is the better prediction, since its average squared loss is lower.
Mean squared error
L: loss for one data point
R: average loss over all data points
- Another term for average squared loss is mean squared error (MSE).
- The mean squared error on our smaller dataset for any prediction
is of the form:
stands for “risk”, as in “empirical risk.” - For example, if we predict
, then:
- We can pick any
as a prediction, but the smaller is, the better is
Visualizing mean squared error
is a parabola centered at
We want to minimize MSE
Mean squared error, in general
- Suppose we colelct
commute times, . - The mean squared error of the prediction
is:
- Or, using summation notation:
The best prediction
- We want the best prediction,
. - The smaller
is, the better is. - Goal: Find the
that minimizes .
The resultingwill be called . - How do we find
?
Summary
- We started with the abstract problem:
Given historical commute times, predict your future commute time.
- We’ve turned it into a formal optimization problem:
Find the prediction
that has the smallest mean squared error on the data.
- Implicitly, we introduced a three-step modeling process that we’ll keep revisiting:
- Choose a model.
- Choose a loss function.
- Minimize average loss,
. - Next time: We’ll solve this optimization problem by-hand