Source: https://dsc40a.com/resources/lectures/lec10/lec10-filled.pdf

Feature Engineering, Gradient Descent

Feature engineering and transformations

Question: Would a linear hypothesis function work well on this dataset?

Linear in the parameters

We can fit rules like:
This includes arbitrary polynomials.
These are all linear combinations of (just) features.
We can’t fit rules like:
These are not linear combinations of just features!
We can have any number of parameters, as long as our hypothesis function is linear in the parameters, or linear when we think of it as a function of the parameters.

Example: Amdahl’s Law

Amdahl’s Law relates the runtime of a program on processors to the time to do the sequential and nonsequential parts on one processor.
Collect data by timing a program with varying numbers of processors:

Processors	Time (Hours)
1	8
2	4
4	3

Example: Fitting

Processors	Time (Hours)
1	8
2	4
4	3

What are and ? Find by solving

System of 2 equations, 2 unknowns

is invertible because is full rank all of ‘s columns are linearly independent

How do we fit hypothesis functions that aren’t linear in the parameters?

Suppose we want to fit the hypothesis function:
This is not linear in terms of and , so our results for linear regression don’t apply.
Possible solution: Try to apply a transformation.

Transformations

Question: Can we re-write as a hypothesis function that is linear in the parameters?

Try to both sides

Linear in the parameters

$lo g (ab) = lo g (a) + lo g (b)$

Transformations

Solution: Create a new hypothesis function, , with parameters and , where .
This hypothesis function is related to by the relationship .
is related to by and .
Our new observation vector, , is .
is linear in its parameters, and .
Use the solution to the normal equations to find , and the relationship between and to find .
Solve:

Once again, let’s try it out! Follow along in this notebook.

Non-linear hypothesis functions in general

Sometimes, it’s just not possible to transform a hypothesis function to be linear in terms of some parameters.
In those cases, you’d have to resort to other methods of finding the optimal parameters.
For example, can’t be transformed to be linear.
But, there are other methods of minimizing mean squared error:
One method: gradient descent, the topic we’re going to look at next!
Hypothesis functions that are linear in the parameters are much easier to work with.

Roadmap

This is the end of the content that’s in scope for the Midterm Exam.
Now, we’ll introduce gradient descent, a technique for minimizing functions that can’t be minimized directly using calculus or linear algebra.
After the Midterm Exam, we’ll:
Finish gradient descent.
Look at a technique for identifying patterns in data when there is no “right answer” , called clustering.
Switch gears to probability.

The modeling recipe figuring out the best way to make predictions

Choose a model.
constant
simple linear regression
prediction for a single data point
predictions for all n data points
Choose a loss function.
Squared loss:
Absolute loss:
0-1 loss
Relative squared loss:
Minimize average loss to find optimal model parameters.
Programming Q in HW 3

Minimizing functions using gradient descent

Minimizing empirical risk

Repeatedly, we’ve been tasked with minimizing the value of empirical risk functions.
Why? To help us find the best model parameters, or , which help us make the best predictions!
We’ve minimized empirical risk functions in various ways.
calculus

for-loop, brute-forced all possible lines
linear algebra: spans, projections

Minimizing arbitrary functions

Assume is some differentiable single-variable function.
When tasked with minimizing , our general strategy has been to:

Find , the derivative of .
Find the input such that .

However, there are cases where we can find , but it is either difficult or impossible to solve .
Then what?

What does the derivative of a function tell us?

Goal: Given a differentiable function , find the input that minimizes .
What does mean?

See dsc40a.com/resources/lectures/lec10 for an animated version of the previous slide!

Let’s go hiking!

Suppose you’re at the top of a mountain 🏔️ and need to get to the bottom.
Further, suppose it’s really cloudy ☁️, meaning you can only see a few feet around you.
How would you get to the bottom?

Searching for the minimum

Suppose we’re given an initial guess for a value of that minimizes .
If the slope of the tangent line at is positive 📈:

Increasing increases .
This means the minimum must be to the left of the point .
Solution: Decrease ⬇️.

Searching for the minimum

Suppose we’re given an initial guess for a value of that minimizes .
If the slope of the tangent line at is negative 📉:

Increasing decreases .
This means the minimum must be to the right of the point .
Solution: Increase ⬆️.

Intuition

To minimize , start with an initial guess .
Where do we go next?
If , decrease .
If , increase .
One way to accomplish this:

Gradient descent

To minimize a differentiable function :

Pick a positive number, . This number is called the learning rate, or step size.
Pick an initial guess, .
Then, repeatedly update your guess using the update rule:
Repeat this process until convergence – that is, when doesn’t change much.
This procedure is called gradient descent.

What is gradient descent?

Gradient descent is a numerical method for finding the input to a function that minimizes the function.
Why is it called gradient descent?
The gradient is the extension of the derivative to functions of multiple variables.
We will see how to use gradient descent with multivariate functions next class.
What is a numerical method?
A numerical method is a technique for approximating the solution to a mathematical problem, often by using the computer.
Gradient descent is widely used in machine learning, to train models from linear regression to neural networks and transformers (includng ChatGPT)!

See dsc40a.com/resources/lectures/lec10 for animated examples of gradient descent, and see this notebook for the associated code!

Lingering questions

Next class, we’ll explore the following ideas:

When is gradient descent guaranteed to converge to a global minimum?
What kinds of functions work well with gradient descent?
How do I choose a step size?
How do I use gradient descent to minimize functions of multiple variables, e.g.:

Gradient descent and empirical risk minimization

While gradient descent can minimize other kinds of differentiable functions, its most common use case is in minimizing empirical risk.
For example, consider:
The constant model, .
The dataset .
The initial guess and the learning rate .
Exercise: Find and .

Carter's Digital Garden

Explorer

DSC 40A Lecture 10

Feature Engineering, Gradient Descent

Feature engineering and transformations

Linear in the parameters

Example: Amdahl’s Law

Example: Fitting

How do we fit hypothesis functions that aren’t linear in the parameters?

Transformations

Transformations

Non-linear hypothesis functions in general

Roadmap

The modeling recipe figuring out the best way to make predictions

Minimizing functions using gradient descent

Minimizing empirical risk

Minimizing arbitrary functions

What does the derivative of a function tell us?

Let’s go hiking!

Searching for the minimum

Searching for the minimum

Intuition

Gradient descent

What is gradient descent?

Lingering questions

Gradient descent and empirical risk minimization

Graph View

Table of Contents

Backlinks