Question: Would a linear hypothesis function work well on this dataset?
Linear in the parameters
We can fit rules like:
This includes arbitrary polynomials.
These are all linear combinations of (just) features.
We can’t fit rules like:
These are not linear combinations of just features!
We can have any number of parameters, as long as our hypothesis function is linear in the parameters, or linear when we think of it as a function of the parameters.
Example: Amdahl’s Law
Amdahl’s Law relates the runtime of a program on processors to the time to do the sequential and nonsequential parts on one processor.
Collect data by timing a program with varying numbers of processors:
Processors
Time (Hours)
1
8
2
4
4
3
Example: Fitting
Processors
Time (Hours)
1
8
2
4
4
3
What are and ? Find by solving
System of 2 equations, 2 unknowns is invertible because is full rank all of ‘s columns are linearly independent
How do we fit hypothesis functions that aren’t linear in the parameters?
Suppose we want to fit the hypothesis function:
This is not linear in terms of and , so our results for linear regression don’t apply.
Possible solution: Try to apply a transformation.
Transformations
Question: Can we re-write as a hypothesis function that is linear in the parameters?
Try to both sides
Linear in the parameters
log(ab)=log(a)+log(b)
Transformations
Solution: Create a new hypothesis function, , with parameters and , where .
This hypothesis function is related to by the relationship .
is related to by and .
Our new observation vector, , is .
is linear in its parameters, and .
Use the solution to the normal equations to find , and the relationship between and to find .
Solve:
Once again, let’s try it out! Follow along in this notebook.
Non-linear hypothesis functions in general
Sometimes, it’s just not possible to transform a hypothesis function to be linear in terms of some parameters.
In those cases, you’d have to resort to other methods of finding the optimal parameters.
For example, can’t be transformed to be linear.
But, there are other methods of minimizing mean squared error:
One method: gradient descent, the topic we’re going to look at next!
Hypothesis functions that are linear in the parameters are much easier to work with.
Roadmap
This is the end of the content that’s in scope for the Midterm Exam.
Now, we’ll introduce gradient descent, a technique for minimizing functions that can’t be minimized directly using calculus or linear algebra.
After the Midterm Exam, we’ll:
Finish gradient descent.
Look at a technique for identifying patterns in data when there is no “right answer” , called clustering.
Switch gears to probability.
The modeling recipe figuring out the best way to make predictions
Choose a model.
constant
simple linear regression
prediction for a single data point
predictions for all n data points
Choose a loss function.
Squared loss:
Absolute loss:
0-1 loss
Relative squared loss:
Minimize average loss to find optimal model parameters.
Programming Q in HW 3
Minimizing functions using gradient descent
Minimizing empirical risk
Repeatedly, we’ve been tasked with minimizing the value of empirical risk functions.
Why? To help us find the best model parameters, or , which help us make the best predictions!
We’ve minimized empirical risk functions in various ways.
calculus
for-loop, brute-forced all possible lines
linear algebra: spans, projections
Minimizing arbitrary functions
Assume is some differentiable single-variable function.
When tasked with minimizing , our general strategy has been to:
Find , the derivative of .
Find the input such that .
However, there are cases where we can find , but it is either difficult or impossible to solve .
Then what?
What does the derivative of a function tell us?
Goal: Given a differentiable function , find the input that minimizes .