This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Calculus

Calculus for AI & ML

Calculus for AI & ML

This sheet contains all the topics that will be covered for Calculus for AI & ML.

1 - Calculus Fundamentals

Calculus Fundamentals

Now, let’s dive deeper and understand the concepts that required for differentiation, such as, limits, continuity, differentiability, etc.



End of Section

2 - Optimization

Loss Function, Convexity & Optimization


💡 Whenever we build a Machine Learning model, we try to make sure that the model makes least mistakes in its predictions.
How do we measure and minimize these mistakes in predictions made by the model?

To measure, how wrong the are the predictions made by a Machine Learning model, every model is formulated as
minimizing a loss function.
💡 Find the point on the line 2x + 3y = 13 that is closest to the origin.

Objective: To minimize the distance between point (x,y) on the line 2x + 3y = 13 and the origin (0,0).
distance, d = \(\sqrt{(x-0)^2 + (y-0)^2}\)
=> Objective function = minimize distance = \( \underset{x^*, y^*}{\mathrm{argmin}}\ f(x,y) = \underset{x^*, y^*}{\mathrm{argmin}}\ x^2+y^2\)
Constraint: Point (x,y) must be on the line 2x + 3y = 13.
=> Constraint (equality) function = \(g(x,y) = 2x + 3y - 13 = 0\)
Lagragian function =

\[ \mathcal{L}(\lambda, x, y) = f(x,y) - \lambda(g(x,y)) \\[10pt] => \mathcal{L}(\lambda, x, y) = x^2+y^2 - \lambda(2x + 3y - 13) \]

To find the optimum solution, we solve the below unconstrained optimization problem.

\[ \underset{x^*, y^*, \lambda}{\mathrm{argmin}}\ \mathcal{L}(\lambda, x, y) = \underset{x^*, y^*, \lambda}{\mathrm{argmin}}\ x^2+y^2 - \lambda(2x + 3y - 13) \]

Take the derivative and equate it to zero.
Since, it is multi-variable function, we take the partial derivatives, w.r.t, x, y and \(\lambda\).

\[ \tag{1} \frac{\partial}{\partial x} \mathcal{L}(\lambda, x, y) = \frac{\partial}{\partial x} (x^2+y^2 - \lambda(2x + 3y - 13)) = 0 \\[10pt] => 2x - 2\lambda = 0 \\[10pt] => x = \lambda \]


\[ \frac{\partial}{\partial y} \mathcal{L}(\lambda, x, y) = \frac{\partial}{\partial y} (x^2+y^2 - \lambda(2x + 3y - 13)) = 0 \\[10pt] \tag{2} => 2y - 3\lambda = 0 \\[10pt] => y = \frac{3}{2} \lambda \]


\[ \frac{\partial}{\partial \lambda} \mathcal{L}(\lambda, x, y) = \frac{\partial}{\partial \lambda} (x^2+y^2 - \lambda(2x + 3y - 13)) = 0 \\[10pt] \tag{3} => -2x -3y + 13 = 0 \]


Now, we have 3 variables and 3 equations (1), (2) and (3), lets solve them.

\[ -2x -3y + 13 = 0 \\[10pt] => 2x + 3 y = 13 \\[10pt] => 2*\lambda + 3*\frac{3}{2} \lambda = 13 \\[10pt] => \lambda(2+9/2) = 13 \\[10pt] => \lambda = 13 * \frac{2}{13} \\[10pt] => \lambda = 2 => x = \lambda = 2 \\[10pt] => y = \frac{3}{2} \lambda = \frac{3}{2} * 2 = 3\\[10pt] => x = 2, y = 3 \]

Hence, the point (x=2, y=3) on the line 2x + 3y = 13 that is closest to the origin.



End of Section

3 - Gradient Descent

Gradient Descent for Optimization
📘

Gradient Descent:
It is a first order iterative optimization algorithm that is used to find the local minimum of a differentiable function.
It iteratively adjusts the parameters of the model in the direction opposite to the gradient of cost function, since moving opposite to the direction of gradient leads towards the minima.

Algorithm:

  1. Initialize the weights/parameters with random values.

  2. Calculate the gradient of the cost function at current parameter values.

  3. Update the parameters using the gradient.

    \[ w_{new} = w_{old} - \eta \cdot \frac{\partial f}{\partial w_{old}} \\[10pt] \eta: \text{ learning rate or step size to take for each parameter update} \]
  4. Repeat steps 2 and 3 iteratively until convergence (to minima).







End of Section

4 - Newton's Method

Newton’s Method for Optimization


📘

Newton’s Method:
It is a second-order iterative gradient based optimization technique known for its extremely fast convergence.
When close to optimum, it achieves quadratic convergence, better than gradient descent’s linear convergence.

Algorithm:

  1. Start at a random point \(x_k\).
  2. Compute the slope at \(x_k, ~i.e,~ f'(x_k)\).
  3. Compute the curvature at \(x_k, ~i.e,~ f''(x_k)\).
  4. Draw a parabola at \(x_k\) that locally approximates the function.
  5. Jump directly to the minimum of that parabola; that’s the next step. Note: So, instead of walking down the slope step by step (gradient descent), we are jumping straight to the point where the curve bends downwards towards the bottom.
\[ x_{k+1} = x_k - \frac{f\prime(x_k)}{f\prime\prime(x_k)} \\[10pt] \text{ step size = } \frac{1}{f\prime\prime(x_k)} \\[10pt] f\prime\prime(x_k) : \text{ tells curvature of the function at } x_k \\[10pt] x_{new} = x_{old} - (\nabla^2 f(x_{old})^{-1} \nabla f(x_{old}) \\[10pt] \nabla^2 f(x_{old}): Hessian \]




For example:

  1. Find the minima of \(f(x) = x^2 - 4x + 5\) To find the minima, lets calulate the first derivative and equate to zero.
    \(f'(x) = 2x - 4 = 0 \)
    \( => x^* = 2 \)

    \(f''(x) = 2 >0 \) => minima is at \(x^* = 2\)

Now, we will solve this using Newton’s Method.
Let’s start at x = 0.

\[ x_{new} = x_{old} - \frac{f\prime(x_{old})}{f\prime\prime(x_{old}} \\[10pt] => x_{new} = 0 - \frac{2*0 -4}{2} = 0-(-2) \\[10pt] => x_{new} = 2 \]

Hence, we can see that using Newton’s Method we can get to the minima \(x^* = 2\) in just 1 step.



End of Section