Source: https://dsc40a.com/resources/lectures/lec10/lec10-filled.pdf

Feature Engineering, Gradient Descent

Feature engineering and transformations

Question: Would a linear hypothesis function work well on this dataset?

Linear in the parameters

  • We can fit rules like:
  • This includes arbitrary polynomials.
  • These are all linear combinations of (just) features.
  • We can’t fit rules like:
  • These are not linear combinations of just features!
  • We can have any number of parameters, as long as our hypothesis function is linear in the parameters, or linear when we think of it as a function of the parameters.

Example: Amdahl’s Law

  • Amdahl’s Law relates the runtime of a program on processors to the time to do the sequential and nonsequential parts on one processor.

  • Collect data by timing a program with varying numbers of processors:
ProcessorsTime (Hours)
18
24
43

Example: Fitting

ProcessorsTime (Hours)
18
24
43



What are and ? Find by solving

System of 2 equations, 2 unknowns

is invertible because is full rank all of ‘s columns are linearly independent

How do we fit hypothesis functions that aren’t linear in the parameters?

  • Suppose we want to fit the hypothesis function:
  • This is not linear in terms of and , so our results for linear regression don’t apply.
  • Possible solution: Try to apply a transformation.

Transformations

  • Question: Can we re-write as a hypothesis function that is linear in the parameters?

    Try to both sides




    Linear in the parameters


Transformations

  • Solution: Create a new hypothesis function, , with parameters and , where .
  • This hypothesis function is related to by the relationship .
  • is related to by and .
  • Our new observation vector, , is .
  • is linear in its parameters, and .
  • Use the solution to the normal equations to find , and the relationship between and to find .
    Solve:

Once again, let’s try it out! Follow along in this notebook.

Non-linear hypothesis functions in general

  • Sometimes, it’s just not possible to transform a hypothesis function to be linear in terms of some parameters.
  • In those cases, you’d have to resort to other methods of finding the optimal parameters.
  • For example, can’t be transformed to be linear.
  • But, there are other methods of minimizing mean squared error:
  • One method: gradient descent, the topic we’re going to look at next!
  • Hypothesis functions that are linear in the parameters are much easier to work with.

Roadmap

  • This is the end of the content that’s in scope for the Midterm Exam.
  • Now, we’ll introduce gradient descent, a technique for minimizing functions that can’t be minimized directly using calculus or linear algebra.
  • After the Midterm Exam, we’ll:
  • Finish gradient descent.
  • Look at a technique for identifying patterns in data when there is no “right answer” , called clustering.
  • Switch gears to probability.

The modeling recipe figuring out the best way to make predictions

  1. Choose a model.
  2. constant
  3. simple linear regression
  4. prediction for a single data point
  5. predictions for all n data points
  6. Choose a loss function.
  7. Squared loss:
  8. Absolute loss:
  9. 0-1 loss
  10. Relative squared loss:
  11. Minimize average loss to find optimal model parameters.
  12. Programming Q in HW 3

Minimizing functions using gradient descent

Minimizing empirical risk

  • Repeatedly, we’ve been tasked with minimizing the value of empirical risk functions.
  • Why? To help us find the best model parameters, or , which help us make the best predictions!
  • We’ve minimized empirical risk functions in various ways.
  • calculus
  • for-loop, brute-forced all possible lines
  • linear algebra: spans, projections

Minimizing arbitrary functions

  • Assume is some differentiable single-variable function.
  • When tasked with minimizing , our general strategy has been to:
  1. Find , the derivative of .
  2. Find the input such that .
  • However, there are cases where we can find , but it is either difficult or impossible to solve .

  • Then what?

What does the derivative of a function tell us?

  • Goal: Given a differentiable function , find the input that minimizes .
  • What does mean?

See dsc40a.com/resources/lectures/lec10 for an animated version of the previous slide!

Let’s go hiking!

  • Suppose you’re at the top of a mountain 🏔️ and need to get to the bottom.
  • Further, suppose it’s really cloudy ☁️, meaning you can only see a few feet around you.
  • How would you get to the bottom?

Searching for the minimum

Suppose we’re given an initial guess for a value of that minimizes .
If the slope of the tangent line at is positive 📈:

  • Increasing increases .
  • This means the minimum must be to the left of the point .
  • Solution: Decrease ⬇️.

Searching for the minimum

Suppose we’re given an initial guess for a value of that minimizes .
If the slope of the tangent line at is negative 📉:

  • Increasing decreases .
  • This means the minimum must be to the right of the point .
  • Solution: Increase ⬆️.

Intuition

  • To minimize , start with an initial guess .
  • Where do we go next?
  • If , decrease .
  • If , increase .
  • One way to accomplish this:

Gradient descent

To minimize a differentiable function :

  • Pick a positive number, . This number is called the learning rate, or step size.
  • Pick an initial guess, .
  • Then, repeatedly update your guess using the update rule:
  • Repeat this process until convergence – that is, when doesn’t change much.
  • This procedure is called gradient descent.

What is gradient descent?

  • Gradient descent is a numerical method for finding the input to a function that minimizes the function.
  • Why is it called gradient descent?
  • The gradient is the extension of the derivative to functions of multiple variables.
  • We will see how to use gradient descent with multivariate functions next class.
  • What is a numerical method?
  • A numerical method is a technique for approximating the solution to a mathematical problem, often by using the computer.
  • Gradient descent is widely used in machine learning, to train models from linear regression to neural networks and transformers (includng ChatGPT)!

See dsc40a.com/resources/lectures/lec10 for animated examples of gradient descent, and see this notebook for the associated code!

Lingering questions

Next class, we’ll explore the following ideas:

  • When is gradient descent guaranteed to converge to a global minimum?
  • What kinds of functions work well with gradient descent?
  • How do I choose a step size?
  • How do I use gradient descent to minimize functions of multiple variables, e.g.:

Gradient descent and empirical risk minimization

  • While gradient descent can minimize other kinds of differentiable functions, its most common use case is in minimizing empirical risk.
  • For example, consider:
  • The constant model, .
  • The dataset .
  • The initial guess and the learning rate .
  • Exercise: Find and .