Vector & Matrix Calculus

8 minute read

In this section we will understand Derivatives of Vector & Matrix, Jacobian and Hessian.

Linear Algebra for AI & ML | Full Course Videos

Vector Derivative

Let \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n×1}}\) be a vector, i.e, \(\mathbf{x} \in \mathbb{R}^n\).

Let \(f(x)\) be a function that maps a vector to a scalar, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\).
The derivative of function \(f(x)\) with respect to \(\mathbf{x}\) is defined as:

\[Gradient = \frac{\partial f(x)}{\partial x} = {f'(x)} = \nabla f(x) = \begin{bmatrix} \frac{\partial x_1}{\partial x} \\ \frac{\partial x_2}{\partial x} \\ \vdots \\ \frac{\partial x_n}{\partial x} \end{bmatrix}_{\text{n×1}} \]

Assumption: All the first order partial derivatives exist.

e.g.:

\(f(x) = a^Tx\), where, \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\), \(\quad a = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}\), \(\quad a^T = \begin{bmatrix} a_1 & a_2 & \cdots & a_n \end{bmatrix}\)

\(=> f(x) = a^Tx = a_1x_1 + a_2x_2 + \cdots + a_nx_n\)

\(=> \frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial x_1}{\partial x} \\ \frac{\partial x_2}{\partial x} \\ \vdots \\ \frac{\partial x_n}{\partial x} \end{bmatrix} \) \(=\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} = a \)
\( => f'(x) = a \)

Now, let’s find the derivative for a bit complex, but very widely used function.
\(f(x) = \mathbf{x^TAx} \quad\), where, \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\), so, \(\quad x^T = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}\),

A is a square matrix, \( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} & \cdots a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2k} & \cdots a_{2n} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{k1} & a_{k2} & \cdots & a_{kk} & \cdots a_{kn} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} & \cdots a_{nn} \end{bmatrix} _{\text{n x n}} \)

So, \( \mathbf{A^T} = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{k1} & \cdots a_{n1} \\ a_{12} & a_{22} & \cdots & a_{k2} & \cdots a_{n2} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{1k} & a_{2k} & \cdots & a_{kk} & \cdots a_{nk} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{kn} & \cdots a_{nn} \end{bmatrix} _{\text{n x n}} \)

\[f(x) = y = \mathbf{x^TAx} = \sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j\]\[\tag{1}\frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_k} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{bmatrix} \]

Let’s calculate \(\frac{\partial y}{\partial x_k}\), i.e, for the \(k^{th}\) element and sum it over 1 to n.
The \(k^{th}\) element \(x_k\) will appear 2 times in the below summation, i.e, when \(i=k ~and~ j=k\)

\[ y = \sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j \\ \text { when } i=k, \quad y = \sum_{j=1}^n x_k a_{kj} x_j \\ \text { when } j=k, \quad y = \sum_{i=1}^n x_i a_{ik} x_k \\[10pt] \frac{\partial y}{\partial x_k} = \frac{\partial }{\partial x_k} (\sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j) \\[10pt] => \frac{\partial y}{\partial x_k} = \sum_{j=1}^n \frac{\partial }{\partial x_k} x_k a_{kj} x_j + \sum_{i=1}^n \frac{\partial }{\partial x_k}x_i a_{ik} x_k \\ => \frac{\partial y}{\partial x_k} = \sum_{j=1}^n 1 \cdot a_{kj} x_j + \sum_{i=1}^n x_i a_{ik} \cdot 1 \\ \because \sum_{j=1}^n x_j = \sum_{i=1}^n x_i \text { we can combine both terms } \\ => \frac{\partial y}{\partial x_k} = \sum_{i=1}^n (a_{ki} + a_{ik}) x_i \\ \text{ Note, that: } \sum_{i=1}^n a_{ki} \text{ is k-th row of A, and } \sum_{i=1}^n a_{ik} \text{ is k-th row of } A^T \\[10pt] \text{ from (1) above: }\frac{\partial y}{\partial x} = \sum_{k=1}^n \frac{\partial y}{\partial x_k} \\ => \frac{\partial y}{\partial x} = \sum_{k=1}^n \sum_{i=1}^n (a_{ki} + a_{ik}) x_i \\[10pt] \therefore \frac{\partial y}{\partial x} = \frac{\partial}{\partial x}(\mathbf{x^TAx}) = \mathbf{(A + A^T) x} \\[10pt] \text{ if } \mathbf{A = A^T} \text{ then, } \\[10pt] \frac{\partial y}{\partial x} = \frac{\partial }{\partial x}(\mathbf{x^TAx}) = \mathbf{2Ax} \\ \]

Jacobian Matrix

Above, we saw the gradient of a scalar valued function, i.e, a function that maps a vector to a scalar, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\).
There is another kind of function called vector valued function, i.e, a function that maps a vector to another vector, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\).

The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function, while the gradient is a vector representing the partial derivatives of a scalar-valued function.

Jacobian matrix provides the best linear approximation of a vector valued function near a given point, similar to how a derivative/gradient is the best linear approximation for a scalar valued function

Note: Gradient is a special case of the Jacobian; it is the transpose of the Jacobian for a scalar valued function.

Let, \(f(x)\) be a function that maps a vector to another vector, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\)

\(f(x)_{m \times 1} = \mathbf{A_{m \times n}x_{n \times 1}}\), where,

\(\quad \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} _{\text{m x n}} \), \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n x 1}}\), \(f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}_{\text{m x 1}}\)

\[\frac{\partial f(x)}{\partial x} = f'(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} _{\text{m x n}} \]

The above matrix is called Jacobian matrix of \(f(x)\).

Assumption: All the first order partial derivatives exist.

Hessian Matrix

Hessian Matrix:
It is a square matrix of second-order partial derivatives of a scalar-valued function.
This is used to characterize the curvature of a function at a give point.

Let, \(f(x)\) be a function that maps a vector to a scalar value, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\)
The Hessian matrix is defined as:

\[Hessian = \frac{\partial f^2(x)}{\partial x \partial x^T } = \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f(x)}{\partial x_1^2} & \frac{\partial^2 f(x)}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & \frac{\partial^2 f(x)}{\partial x_2^2} & \cdots & \frac{\partial^2 f(x)}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(x)}{\partial x_n \partial x_1} & \frac{\partial^2 f(x)}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_n^2} \end{bmatrix} _{\text{n x n}} \]

Note: Most functions in Machine Learning where second-order partial derivatives are continuous, the Hessian is symmetrix.

\( H_{i,j} = H_{j,i} = \frac{\partial f^2(x)}{\partial x_i \partial x_j } = \frac{\partial f^2(x)}{\partial x_j \partial x_i } \)

e.g:

\(f(x) = \mathbf{x^TAx}\), where, A is a symmetric matrix, and f(x) is a scalar valued function.

\(Gradient = \frac{\partial }{\partial x}(\mathbf{x^TAx}) = 2\mathbf{Ax}\), since A is symmetric.

\(Hessian = \frac{\partial^2 }{\partial x^2}(\mathbf{x^TAx}) = \frac{\partial }{\partial x}2\mathbf{Ax} = 2\mathbf{A}\)
\(f(x,y) = x^2 + y^2\)

\(Gradient = \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x}\\ \\ \frac{\partial f}{\partial x} \end{bmatrix} = \begin{bmatrix} 2x+0 \\ \\ 0+2y \end{bmatrix} = \begin{bmatrix} 2x \\ \\ 2y \end{bmatrix}\)

\(Hessian = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y}\\ \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ \\ 0 & 2 \end{bmatrix}\)

Matrix Derivative

Let, A is a mxn matrix, i.e \(A \in \mathbb{R}^{m \times n}\)

\(\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} _{\text{m x n}} \)

Let f(A) be a function that maps a matrix to a scalar value, i.e, \(f : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}\)
, then derivative of function f(A) w.r.t A is defined as:

\[\frac{\partial f}{\partial A} = f'(A) = \begin{bmatrix} \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \end{bmatrix} _{\text{m x n}} \\[20pt] => (\frac{\partial f}{\partial A})_{(i,j)} = \frac{\partial f}{\partial a_{ij}} \]

e.g.:

Let, A is a square matrix, i.e \(A \in \mathbb{R}^{n \times n}\)
and f(A) = trace(A) = \(a_{11} + a_{22} + \cdots + a_{nn}\)

We know that:
\((\frac{\partial f}{\partial A})_{(i,j)} = \frac{\partial f}{\partial a_{ij}}\)

Since, the trace only contains diagonal elements,

=> \((\frac{\partial f}{\partial A})_{(i,j)}\) = 0 for all \(i ⍯ j\)

similarly, \((\frac{\partial f}{\partial A})_{(i,i)}\) = 1 for all \(i=j\)

=> \( \frac{\partial Tr(A)}{\partial A} = \begin{cases} 1, & \text{if } i=j \\ \\ 0, & \text{if } i⍯j \end{cases} \)

\(\frac{\partial f}{\partial A} = \frac{\partial Tr(A)}{\partial A} \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1& \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \mathbf{I} \)

Therefore, derivative of trace(A) w.r.t A is an identity matrix.

Watch Video

End of Section