Vector & Matrix Calculus
8 minute read
In this section we will understand Derivatives of Vector & Matrix, Jacobian and Hessian.
Vector Derivative
Let \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n×1}}\) be a vector, i.e, \(\mathbf{x} \in \mathbb{R}^n\).
Let \(f(x)\) be a function that maps a vector to a scalar, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\).
The derivative of function \(f(x)\) with respect to \(\mathbf{x}\) is defined as:
Assumption: All the first order partial derivatives exist.
e.g.:
- \(f(x) = a^Tx\), where,
\(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\),
\(\quad a = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}\),
\(\quad a^T = \begin{bmatrix} a_1 & a_2 & \cdots & a_n \end{bmatrix}\)
\(=> f(x) = a^Tx = a_1x_1 + a_2x_2 + \cdots + a_nx_n\)
\(=> \frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial x_1}{\partial x} \\ \frac{\partial x_2}{\partial x} \\ \vdots \\ \frac{\partial x_n}{\partial x} \end{bmatrix} \) \(=\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} = a \)
\( => f'(x) = a \)
- Now, let’s find the derivative for a bit complex, but very widely used function.
\(f(x) = \mathbf{x^TAx} \quad\), where, \(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\), so, \(\quad x^T = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}\),
A is a square matrix, \( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} & \cdots a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2k} & \cdots a_{2n} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{k1} & a_{k2} & \cdots & a_{kk} & \cdots a_{kn} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} & \cdots a_{nn} \end{bmatrix} _{\text{n x n}} \)
So, \( \mathbf{A^T} = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{k1} & \cdots a_{n1} \\ a_{12} & a_{22} & \cdots & a_{k2} & \cdots a_{n2} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{1k} & a_{2k} & \cdots & a_{kk} & \cdots a_{nk} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{kn} & \cdots a_{nn} \end{bmatrix} _{\text{n x n}} \)
Let’s calculate \(\frac{\partial y}{\partial x_k}\), i.e, for the \(k^{th}\) element and sum it over 1 to n.
The \(k^{th}\) element \(x_k\) will appear 2 times in the below summation, i.e, when \(i=k ~and~ j=k\)
Jacobian Matrix
Above, we saw the gradient of a scalar valued function, i.e, a function that maps a vector to a scalar, i.e,
\(f : \mathbb{R}^n \rightarrow \mathbb{R}\).
There is another kind of function called vector valued function, i.e, a function that maps a vector to another vector, i.e,
\(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\).
The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function,
while the gradient is a vector representing the partial derivatives of a scalar-valued function.
- Jacobian matrix provides the best linear approximation of a vector valued function near a given point,
similar to how a derivative/gradient is the best linear approximation for a scalar valued function
Note: Gradient is a special case of the Jacobian;
it is the transpose of the Jacobian for a scalar valued function.
Let, \(f(x)\) be a function that maps a vector to another vector, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\)
\(f(x)_{m \times 1} = \mathbf{A_{m \times n}x_{n \times 1}}\), where,
\(\quad
\mathbf{A} =
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn}
\end{bmatrix}
_{\text{m x n}}
\),
\(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{n x 1}}\),
\(f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}_{\text{m x 1}}\)
The above matrix is called Jacobian matrix of \(f(x)\).
Assumption: All the first order partial derivatives exist.
Hessian Matrix
Hessian Matrix:
It is a square matrix of second-order partial derivatives of a scalar-valued function.
This is used to characterize the curvature of a function at a give point.
Let, \(f(x)\) be a function that maps a vector to a scalar value, i.e, \(f : \mathbb{R}^n \rightarrow \mathbb{R}\)
The Hessian matrix is defined as:
Note: Most functions in Machine Learning where second-order partial derivatives are continuous, the Hessian is symmetrix.
\( H_{i,j} = H_{j,i} = \frac{\partial f^2(x)}{\partial x_i \partial x_j } = \frac{\partial f^2(x)}{\partial x_j \partial x_i } \)
e.g:
\(f(x) = \mathbf{x^TAx}\), where, A is a symmetric matrix, and f(x) is a scalar valued function.
\(Gradient = \frac{\partial }{\partial x}(\mathbf{x^TAx}) = 2\mathbf{Ax}\), since A is symmetric.
\(Hessian = \frac{\partial^2 }{\partial x^2}(\mathbf{x^TAx}) = \frac{\partial }{\partial x}2\mathbf{Ax} = 2\mathbf{A}\)\(f(x,y) = x^2 + y^2\)
\(Gradient = \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x}\\ \\ \frac{\partial f}{\partial x} \end{bmatrix} = \begin{bmatrix} 2x+0 \\ \\ 0+2y \end{bmatrix} = \begin{bmatrix} 2x \\ \\ 2y \end{bmatrix}\)
\(Hessian = \nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y}\\ \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 0 \\ \\ 0 & 2 \end{bmatrix}\)
Matrix Derivative
Let, A is a mxn matrix, i.e \(A \in \mathbb{R}^{m \times n}\)
\(\mathbf{A} =
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn}
\end{bmatrix}
_{\text{m x n}}
\)
Let f(A) be a function that maps a matrix to a scalar value, i.e, \(f : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}\)
, then derivative of function f(A) w.r.t A is defined as:
e.g.:
- Let, A is a square matrix, i.e \(A \in \mathbb{R}^{n \times n}\)
and f(A) = trace(A) = \(a_{11} + a_{22} + \cdots + a_{nn}\)
We know that:
\((\frac{\partial f}{\partial A})_{(i,j)} = \frac{\partial f}{\partial a_{ij}}\)
Since, the trace only contains diagonal elements,
=> \((\frac{\partial f}{\partial A})_{(i,j)}\) = 0 for all \(i ⍯ j\)
similarly, \((\frac{\partial f}{\partial A})_{(i,i)}\) = 1 for all \(i=j\)
=> \( \frac{\partial Tr(A)}{\partial A} = \begin{cases} 1, & \text{if } i=j \\ \\ 0, & \text{if } i⍯j \end{cases} \)
\(\frac{\partial f}{\partial A} = \frac{\partial Tr(A)}{\partial A} \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1& \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \mathbf{I} \)
Therefore, derivative of trace(A) w.r.t A is an identity matrix.
End of Section