This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Linear Algebra

Linear Algebra for AI & ML

In this section, we will cover the topics related to Linear Algebra for AI & ML.

Linear Algebra for AI & ML

This sheet contains all the topics that will be covered for Linear Algebra for AI & ML.

Below is the list of topics that will be covered in this section.

  • Vectors
  • Vector Space
  • Vector Addition & Multiplication
  • Orthogonal & Orthonormal Vectors
  • Linear Independence
  • Span
  • Basis
  • Matrix & Transpose
  • Matrix Multiplication
  • Square Matrix
  • Trace
  • Determinant
  • Inverse Matrix
  • Orthogonal Matrix
  • Diagonal Matrix
  • Eigenvalue & Eigenvector
  • Eigendecomposition
  • Spectral Decomposition
  • Singular Value Decomposition
  • Principal Component Analysis
  • Norms
  • Matrix Calculus
  • Equation of a Hyperplane

End of Section

1 - Vector Fundamentals

Vector Fundamentals

This section will introduce you to basic terminologies and definitions related to Vectors.


πŸ’‘ Why study Vectors?

Vector is a fundamental concept used to describe the real world, which has magnitude and direction, e.g., force, velocity, electromagnetism, etc.
It is used to describe the surrounding space, e.g, lines, planes, 3D space, etc.
And, In machine learning, vectors are used to represent data, both the input features and the output of a model.

πŸ“˜

Vector:
It is a collection of scalars(numbers) that has both magnitude and direction.
Geometrically, it is a line segment in space characterized by its length(magnitude) and direction.
By convention, we represent vectors as column vectors.
e.g.: \(\vec{x} = \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}_{\text{nΓ—1}}\) i.e ’n’ rows and 1 column.

Important: In machine learning, we will use bold notation \(\mathbf{v}\) to represent vectors, instead of arrow notation \(\mathbf{v}\).

Transpose:
Swap the rows and columns, i.e, a column vector becomes a row vector after transpose.
e.g: \(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_d \end{bmatrix}_{\text{dΓ—1}}\)

\(\mathbf{v}^\mathrm{T} = \begin{bmatrix} v_1 & v_2 & \cdots & v_d \end{bmatrix}_{\text{1Γ—d}}\)


Length of Vector:
The length (or magnitude or norm) of a vector \(\mathbf{v}\) is the distance from the origin to the point represented
by \(\mathbf{v}\) in n-dimensional space.

\(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_d \end{bmatrix}_{\text{dΓ—1}}\)

Length of vector:

\[\mathbf{v} = \|\mathbf{v}\| = \mathbf{v} \cdot \mathbf{v} = \mathbf{v}^\mathrm{T}\mathbf{v} = \sqrt{v_1^2 + v_2^2 + \cdots + v_d^2}\]

Note: The length of a zero vector is 0.

Direction of Vector:
The direction of a vector tells us where the vector points in space, independent of its length.

Direction of vector:

\[\mathbf{v} = \frac{\vec{v}} {\|\mathbf{v}\|} \]

πŸ“˜

Vector Space:
It is a collection of vectors that can be added together and scaled by numbers (scalars), such that, the results are still in the same space.
Vector space or linear space is a non-empty set of vectors equipped with 2 operations:

  1. Vector addition: for any 2 vectors \(a, b\), \(a + b\) is also in the same vector space.
  2. Scalar multiplication: for a vector \(\mathbf{v}\), \(\alpha\mathbf{v}\) is also in the same vector space; where \(\alpha\) is a scalar.
    Note: These operations must satisfy certain rules (called axioms), such as, associativity, commutativity, distributivity, existence of a zero vector, and additive inverses.

e.g.: Set of all points in 2D is a vector space.



πŸ“˜ Linear Combination:
A vector \(\mathbf{v}\) is a linear combination of vectors \(\mathbf{u}_1, \mathbf{u}_2, \cdots, \mathbf{u}_n\) if:

\(\mathbf{v} = \alpha_1 \mathbf{u}_1 + \alpha_2 \mathbf{u}_2 + \cdots + \alpha_k \mathbf{u}_k\)
where \(\alpha_1, \alpha_2, \cdots, \alpha_k\) are scalars.

πŸ“˜

Linear Independence:
A set of vectors are linearly independent if NO vector in the set can be expressed as a linear combination of the other vectors in the set.

The only solution for :
\(\alpha_1 \mathbf{u}_1 + \alpha_2 \mathbf{u}_2 + \cdots + \alpha_k \mathbf{u}_k\) = 0
is \(\alpha_1 = \alpha_2, \cdots, = \alpha_k = 0\).

e.g.:

  1. The below 3 vectors are linearly independent.
    \(\mathbf{u} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}_{\text{3Γ—1}}\), \(\mathbf{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}_{\text{3Γ—1}}\), \(\mathbf{w} = \begin{bmatrix} 1 \\ 3 \\ 6 \\ \end{bmatrix}_{\text{3Γ—1}}\)

  2. The below 3 vectors are linearly dependent.
    \(\mathbf{u} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix}_{\text{3Γ—1}}\), \(\mathbf{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}_{\text{3Γ—1}}\), \(\mathbf{w} = \begin{bmatrix} 2 \\ 4 \\ 6 \\ \end{bmatrix}_{\text{3Γ—1}}\)

    because, \(\mathbf{w} = 2\mathbf{v}\), and we have a non-zero solution for the below equation:
    \(\alpha_1 \mathbf{u} + \alpha_2 \mathbf{v} + \alpha_3 \mathbf{w} = 0\);
    \(\alpha_1 = 0, \alpha_2 = -2, \alpha_3 = 1\) is a valid non-zero solution.

Note:

  1. A common method to check linear independence is to arrange the column vectors in a matrix form and calculate its determinant, if determinant β‰  0, then the vectors are linearly independent.
  2. If number of vectors > number of dimensions, then the vectors are linearly dependent.
    Since, the \((n+1)^{th}\) vector can be expressed as a linear combination of the other ’n’ vectors in n-dimensional space.
  3. In machine learning, if a feature can be expressed in terms of other features, then it is linearly dependent,
    and the feature is NOT bringing any new information.
    e.g.: In 2 dimensions, 3 vectors \(\mathbf{x_1}, \mathbf{x_2}, \mathbf{x_3} \) are linearly dependent.


πŸ“˜

Span:
Span of a set of vectors is the geometric shape by all possible linear combinations of those vectors, such as, line, plane, or higher dimensional volume.
e.g.:

  1. Span of a single vector (1,0) is the entire X-axis.
  2. Span of 2 vectors (1,0) and (0,1) is the entire X-Y (2D) plane, as any vector in 2D plane can be expressed as a linear combination of the 2 vectors - (1,0) and (0,1).

πŸ“˜

Basis:
It is the minimal set of linearly independent vectors that spans or defines the entire vector space, providing a unique co-ordinate system for every vector within the space.
Every vector in the vector space can be represented as a unique linear combination of the basis vectors.
e.g.:

  1. X-axis(1,0) and Y-axis(0,1) are the basis vectors of 2D space or form the co-ordinate system.
  2. \(\mathbf{u} = (3,1)\) and \(\mathbf{v} = (-1, 2) \) are the basis of skewed or parallelogram co-ordinate system.

Note: Basis = Dimensions



πŸ“˜

Orthogonal Vectors:
Two vectors are orthogonal if their dot product is 0.
A set of vectors \(\mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \) are said to be orthogonal if:
\(\mathbf{x_i} \cdot \mathbf{x_j} = 0 \forall i⍯j\), for every pair, i.e, every pair must be orthogonal.
e.g.:

  1. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,1) \) are orthogonal vectors.
  2. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (1,1) \) are NOT orthogonal vectors.

Note:

  1. Orthogonal vectors are linearly independent, but the inverse may NOT be true,
    i.e, linear independence does NOT imply that vectors are orthogonal. e.g.:
    Vectors \(\mathbf{u} = (2,0)\) and \(\mathbf{v} = (1,3) \) are linearly independent but NOT orthogonal.
    Since, \(\mathbf{u} \cdot \mathbf{v} = 2*1 + 3*0 = 2 ⍯ 0\).

  2. Orthogonality is the most extreme case of linear independence, i.e \(90^{\degree}\) apart or perpendicular.


πŸ“˜

Orthonormal Vectors:
Orthonormal vectors are vectors that are orthogonal and have unit length.
A set of vectors \(\mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \) are said to be orthonormal if:
\(\mathbf{x_i} \cdot \mathbf{x_j} = 0 \forall i⍯j\) and \(\|\mathbf{x_i}\| = 1\), i.e, unit vector.

e.g.:

  1. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,1) \) are orthonormal vectors.
  2. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,2) \) are NOT orthonormal vectors.

πŸ“˜

Orthonormal Basis:
It is a set of vectors that functions as a basis for a space while also being orthonormal,
meaning each vector is a unit vector (has a length of 1) and all vectors are mutually perpendicular (orthogonal) to each other.
A set of vectors \(\mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \) are said to be orthonormal basis of a vector space \(\mathbb{R}^{n \times 1}\), if every vector:

\[ \mathbf{y} = \sum_{k=1}^n \alpha_k \mathbf{x_k}, \quad \forall ~ \mathbf{y} \in \mathbb{R}^{n \times 1} \\ \text {and } \quad \mathbf{x_1}, \mathbf{x_2}, \cdots ,\mathbf{x_n} \text { are orthonormal vectors} \]

e.g.:

  1. \(\mathbf{u} = (1,0)\) and \(\mathbf{v} = (0,1) \) form an orthonormal basis for 2-D space.
  2. \(\mathbf{u} = (\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}})\) and \(\mathbf{v} = (-\frac{1}{\sqrt{2}},-\frac{1}{\sqrt{2}}) \) also form an orthonormal basis for 2-D space.

Note: In a n-dimensional space, there are only \(n\) possible orthonormal bases.




End of Section

2 - Matrix Operations

Matrix Operations

This section will understand various Matrix operations, such as, addition, multiplication, determinant, inverse, etc.


πŸ’‘ Why study Matrix?

Matrices let us store, manipulate, and transform data efficiently.
e.g:

  1. Represent a system of linear equations AX=B.
  2. Data representation, such as images, that are stored as a matrix of pixels.
  3. When multiplied, matrix, linearly transforms a vector, i.e, its direction and magnitude, making it useful in image rotation, scaling, etc.
πŸ“˜

Matrix
It is a two-dimensional array of numbers with a fixed number of rows(m) and columns(n).
e.g:
\( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} _{\text{m x n}} \)

Transpose:
Swapping rows and columns of a matrix.
\( \mathbf{A}^T = \begin{bmatrix} a_{11} & a_{21} & \cdots & a_{m1} \\ a_{12} & a_{22} & \cdots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \cdots & a_{mn} \end{bmatrix} _{\text{n x m}} \)

Important: \( (AB)^T = B^TA^T \)

Rank:
Rank of a matrix is the number of linearly independent rows or columns of the matrix.

πŸ“˜

Orthogonal Matrix:
It is a square matrix that whose rows and columns are orthonormal vectors, i.e, they are perpendicular to each other and have a unit length.

\( \mathbf{A} \mathbf{A}^\mathrm{T} = \mathbf{A}^\mathrm{T} \mathbf{A} = \mathbf{A} \mathbf{A}^{-1} = \mathbf{I} \)
=> \( \mathbf{A}^\mathrm{T} = \mathbf{A}^{-1} \)

Note:

  1. Orthogonal matrix preserves the length and angles of vectors, acting as rotation or reflection in geometry.
  2. The determinant of an orthogonal matrix is always +1 or -1, because \( \mathbf{A} \mathbf{A}^\mathrm{T} = \mathbf{I} \)

Let’s check out why the rows and columns of an orthogonal matrix are orthonormal.
\( \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix} _{\text{2 x 2}} \), => \( \mathbf{A}^\mathrm{T} = \begin{bmatrix} a_{11} & a_{21} \\ \\ a_{12} & a_{22} \end{bmatrix} _{\text{2 x 2}} \)

Since, \( \mathbf{A} \mathbf{A}^\mathrm{T} = \mathbf{I} \)
\( \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} a_{11} & a_{21} \\ \\ a_{12} & a_{22} \end{bmatrix} = \begin{bmatrix} a_{11}^2 + a_{12}^2 & a_{11}a_{21} + a_{12}a_{22} \\ \\ a_{21}a_{11} + a_{21}a_{12} & a_{21}^2 + a_{22}^2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \\ 0 & 1 \end{bmatrix} \)

Equating the terms, we get:
\( a_{11}^2 + a_{12}^2 = 1 \) => row 1 is a unit vector
\( a_{21}^2 + a_{22}^2 = 1 \) => row 2 is a unit vector
\( a_{11}a_{21} + a_{12}a_{22} = 0 \) => row 1 and row 2 are orthogonal to each other, since dot product = 0
\( a_{21}a_{11} + a_{21}a_{12} = 0 \) => row 1 and row 2 are orthogonal to each other, since dot product = 0

Therefore, the rows and columns of an orthogonal matrix are orthonormal.


πŸ’‘ Why multiplication of a vector with an orthogonal matrix does NOT change its size?

Let, \(\mathbf{Q}\) is an orthogonal matrix, and \(\mathbf{v}\) is a vector.
Let’s calculate the length of \( \mathbf{Q} \mathbf{v} \)

\[ \text{ length of } \mathbf{Qv} = \|\mathbf{Qv}\| \\ = (\mathbf{Qv})^T\mathbf{Qv} = \mathbf{v}^T\mathbf{Q}^T\mathbf{Qv} \\ \text{ but } \mathbf{Q}^T\mathbf{Q} = \mathbf{I} \quad \text{ since, Q is orthogonal }\\ = \mathbf{v}^T\mathbf{v} = \|\mathbf{v}\| \quad \text{ = length of vector }\\ \]

Therefore, linear transformation of a vector by an orthogonal matrix does NOT change its length.


πŸ’‘ Solve the following system of equations:
\( 2x + y = 5 \)
\( x + 2y = 4 \)

Lets solve the system of equations using matrix:
\(\begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \) \( \begin{bmatrix} x \\ \\ y \end{bmatrix} = \begin{bmatrix} 5 \\ \\ 4 \end{bmatrix} \)

The above equation can be written as:
\(\mathbf{AX} = \mathbf{B} \)
=> \( \mathbf{X} = \mathbf{A}^{-1} \mathbf{B} \)

\( \mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix}, \mathbf{B} = \begin{bmatrix} 5 \\ \\ 4 \end{bmatrix}, \mathbf{X} = \begin{bmatrix} x \\ \\ y \end{bmatrix} \)

Lets compute the inverse of A matrix:
\( \mathbf{A}^{-1} = \frac{1}{3}\begin{bmatrix} 2 & -1 \\ \\ -1 & 2 \end{bmatrix} \)

Since, \( \mathbf{X} = \mathbf{A}^{-1} \mathbf{B} \)
=> \( \mathbf{X} = \begin{bmatrix} x \\ \\ y \end{bmatrix} = \frac{1}{3}\begin{bmatrix} 2 & -1 \\ \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 5 \\ \\ 4 \end{bmatrix} = \frac{1}{3}\begin{bmatrix} 6 \\ \\ 3 \end{bmatrix} = = \begin{bmatrix} 2 \\ \\ 1 \end{bmatrix} \)

Therefore, \( x = 2 \) and \( y = 1 \).




End of Section

3 - Eigen Value Decomposition

Eigen Values, Eigen Vectors, & Eigen Value Decomposition

In this section we will understand Eigen Values, Eigen Vectors, & Eigen Value Decomposition.


πŸ’‘ What is the meaning of the word “Eigen” ?

Eigen is a German word that means “Characteristic” or “Proper”.
It tells us about the characteristic properties of a matrix.

πŸ“˜ Linear Transformation
A linear transformation defined by a matrix, denoted as \(T(x)=A\mathbf{x}\), is a function that maps a vector \(\mathbf{x}\) to a new vector by multiplying it by a matrix \(A\).
Multiplying a vector by a matrix can change the direction or magnitude or both of the vector.
For example:
\( \mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\mathbf{u} = \begin{bmatrix} 0 \\ \\ 1 \\ \end{bmatrix}\), \(\mathbf{v} = \begin{bmatrix} 1 \\ \\ 1 \\ \end{bmatrix}\)

\(\mathbf{Au} = \begin{bmatrix} 1 \\ \\ 2 \\ \end{bmatrix}\) , \(\quad\) \(\mathbf{Av} = \begin{bmatrix} 3 \\ \\ 3 \\ \end{bmatrix}\)


πŸ“˜ Eigen Vector
A special non-zero vector whose direction remains unchanged after transformation by a matrix is applied.
It might get scaled up or down but does not change its direction.
Result of linear transformation, i.e, multiplying the vector by a matrix, is just a scalar multiple of the original vector.

πŸ“˜ Eigen Value (\(\lambda\))
It is the scaling factor of the eigen vector, i.e, a scalar multiple \(\lambda\) of the original vector, when the vector is multiplied by a matrix.
\(|\lambda| > 1 \): Vector stretched
\(0 < |\lambda| < 1 \): Vector shrunk
\(|\lambda| = 1 \): Same size
\(\lambda < 0 \): Vector’s direction is reversed

πŸ’‘ What are the eigen values and vectors of an identity matrix?

Characteristic equation for identity matrix:
\(\mathbf{Iv} = \lambda \mathbf{v}\)
Therefore, identity matrix has only one eigen value \(\lambda = 1\), and all non-zero vectors can be eigen vectors.

πŸ’‘ Are the eigen values of a real matrix always real?

No, eigen values can be complex; if complex, then always occur in conjugate pairs. e.g:
\(\mathbf{A} = \begin{bmatrix} 0 & 1 \\ \\ -1 & 0 \end{bmatrix} \), \( \quad det(\mathbf{A} - \lambda \mathbf{I}) = 0 \quad \) => det \(\begin{bmatrix} 0-\lambda & 1 \\ \\ -1 & 0-\lambda \end{bmatrix} = 0 \)

=> \(\lambda^2 + 1 = 0\) => \(\lambda = \pm i\)
So, eigen values are complex.

πŸ’‘ What are the eigen values of a diagonal matrix?

The eigen values of a diagonal matrix are the diagonal elements themselves.
e.g.:
\( \mathbf{D} = \begin{bmatrix} d_{11} & 0 & \cdots & 0 \\ 0 & d_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{nn} \end{bmatrix} _{\text{n x n}} \), \( \quad det(\mathbf{D} - \lambda \mathbf{I}) = 0 \quad \) => det \( \begin{bmatrix} d_{11}-\lambda & 0 & \cdots & 0 \\ 0 & d_{22}-\lambda & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_{nn}-\lambda \end{bmatrix} _{\text{n x n}} \)

=> \((d_{11}-\lambda)(d_{22}-\lambda) \cdots (d_{nn}-\lambda) = 0\)
=> \(\lambda = d_{11}, d_{22}, \cdots, d_{nn}\)
So, eigen values are the diagonal elements of the matrix.

πŸ’‘ How will we calculate the 2nd power of a matrix i.e \(\mathbf{A}^2\)?

Let’ calculate the 2nd power of a square matrix.

e.g.:
\(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\quad \mathbf{A}^2 = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ \\ 4 & 5 \end{bmatrix} \)


πŸ’‘ Now, how will we calculate higher powers of a matrix i.e \(\mathbf{A}^k\)?

If we follow the above method, then we will have to multiply the matrix \(\mathbf{A}\), \(k\) times, which will be very time consuming and cumbersome.
So, we need to find an easier way to calculate the power of a matrix.

πŸ’‘ How will we calculate the power of diagonal matrix?

Let’s calculate the 2nd power of a diagonal matrix.

e.g.:
\(\mathbf{A} = \begin{bmatrix} 3 & 0 \\ \\ 0 & 2 \end{bmatrix} \), \(\quad \mathbf{A}^2 = \begin{bmatrix} 3 & 0 \\ \\ 0 & 2 \end{bmatrix} \begin{bmatrix} 3 & 0 \\ \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 9 & 0 \\ \\ 0 & 4 \end{bmatrix} \)

Note that when we square the diagonal matrix, then all the diagonal elements got squared.
Similarly, if we want to calculate the kth power of a diagonal matrix, then all we need to do is to just compute the kth powers of all diagonal elements, instead of complex matrix multiplications.
\(\quad \mathbf{A}^k = \begin{bmatrix} 3^k & 0 \\ \\ 0 & 2^k \end{bmatrix} \)

Therefore, if we diagonalize a square matrix then the computation of power of the matrix will become very easy.
Next, let’s see how to diagonalize a matrix.

For example:
Let’s revisit the example given above:
\(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \), \(\quad \lambda_1 = 3\) and \( \lambda_2 = 1\), \(\mathbf{v_1} = \begin{bmatrix} 1 \\ \\ 1 \\ \end{bmatrix}\), \(\mathbf{v_2} = \begin{bmatrix} 1 \\ \\ -1 \\ \end{bmatrix}\)

=> \( \mathbf{V} = \begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \), \(\quad \mathbf{\Lambda} = \begin{bmatrix} 3 & 0 \\ \\ 0 & 1 \end{bmatrix} \)

\( \because \mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1}\)
We know, \( \mathbf{V} ~and~ \mathbf{\Lambda} \), we need to calculate \(\mathbf{V}^{-1}\).

\(\mathbf{V}^{-1} = \frac{1}{2}\begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \)

\( \therefore \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} = \begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \begin{bmatrix} 3 & 0 \\ \\ 0 & 1 \end{bmatrix} \frac{1}{2}\begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} \)

\( = \frac{1}{2} \begin{bmatrix} 3 & 1 \\ \\ 3 & -1 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 \\ \\ 1 & -1 \\ \end{bmatrix} = \frac{1}{2}\begin{bmatrix} 4 & 2 \\ \\ 2 & 4 \\ \end{bmatrix} \)

\( = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} = \mathbf{A} \)



End of Section

4 - Principal Component Analysis

Principal Component Analysis

In this section we will understand Principal Component Analysis.


πŸ’‘ In the diagram below, if we need to reduce the dimensionality of the data to 1, which feature should be dropped?

Whenever we want to reduce the dimensionality of the data, we should aim to minimize information loss.
Since, information = variance, we should drop the feature that brings least information, i.e, has least variance.
Therefore, drop the feature 1.

πŸ’‘ What if the variance in both directions is same ?
What should be done in this case? Check the diagram below.
Here, since the variance in both directions is approximately same, in order to capture maximum variance in data,
we will rotate the f1-axis in the direction of maximum spread/variance of data, i.e, f1’-axis and then we can drop f2’-axis, which is perpendicular to f1’-axis.

πŸ“˜

Principal Component Analysis (PCA)
It is a dimensionality reduction technique that finds the direction of maximum variance in the data.
Note: Some loss of information will always be there in dimensionality reduction, because there will be some variability in data along the direction that is dropped, and that will be lost.

Goal:
Fundamental goal of PCA is to find the new set of orthogonal axes, called the principal components, onto which the data can be projected, such that, the variance of the projected data is maximum.

Say, we have data, \(D:X \in \mathbb{R}^{n \times d}\),
n is the number of samples
d is the number of features or dimensions of each data point.
In order to find the directions of maximum variance in data, we will use the covariance matrix of data.
Covariance matrix (C) summarizes the spread and relationship of the data in the original d-dimensional space.
\(C_{d \times d} = \frac{1}{n-1}X^TX \), where \(X\) is the data matrix.
Note: (n-1) in the denominator is for unbiased estimation(Bessel’s correction) of covariance matrix.
\(C_{ii}\) is the variance of the \(i^{th}\) feature.
\(C_{ij}\) is the co-variance between feature \(i\) and feature \(j\).
Trace(C) = Sum of diagonal elements of C = Total variance of data.

Algorithm:

  1. Data is first mean centered, i.e, make mean = 0, i.e, subtract mean from each data point.
    \(X = X - \mu\)

  2. Compute the covariance matrix with mean centered data.
    \(C = \frac{1}{n-1}X^TX \), \( \quad \Sigma = \begin{bmatrix} var(f_1) & cov(f_1f_2) & \cdots & cov(f_1f_d) \\ cov(f_2f_1) & var(f_2) & \cdots & cov(f_2f_d) \\ \vdots & \vdots & \ddots & \vdots \\ cov(f_df_1) & cov(f_df_2) & \cdots & var(f_d) \end{bmatrix} _{\text{d x d}} \)

  3. Perform the eigen value decomposition of covariance matrix.
    \( C = Q \Lambda Q^T \)
    \(C\): Orthogonal matrix of eigen vectors of covariance matrix.
    New rotated axes or prinicipal components of the data.
    \(\Lambda\): Diagonal matrix of eigen values of covariance matrix.
    Scaling of variance along new eigen basis.
    Note: Eigen values are sorted in descending order, i.e \( \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \).

  4. Project the data onto the new axes or principal components/directions.
    Note: k < d = reduced dimensionality.
    \(X_{new} = Z = XQ_k\)
    \(X_{new}\): Projected data or principal component score


πŸ’‘ What is the variance of the projected data?

Variance of projected data is given by the eigen value of the co-variance matrix.
Covariance of projected data = \((XQ)^TXQ \)

\[ \begin{aligned} (XQ)^TXQ &= Q^TX^TXQ, \quad \text{ since, } (AB)^T = B^TA^T \\ &= Q^TQ \Lambda Q^TQ, \quad \text{ since, } C = X^TX = Q \Lambda Q^T \\ &= \Lambda, \quad \text{ since Q is orthogonal, } Q^TQ = I \\ \end{aligned} \]

Therefore, the diagonal matrix \( \Lambda \) captures the variance along every principal component direction.




End of Section

5 - Singular Value Decomposition

Singular Value Decomposition

In this section we will understand Singular Value Decomposition.


πŸ“˜

Singular Value Decomposition (SVD):
It decomposes any matrix into a rotation, a scaling (based on singular values), and another rotation.
It is a generalization of the eigen value decomposition(for square matrix) to rectangular matrices.
Any rectangular matrix can be decomposed into a product of three matrices using SVD, as follows:

\[\mathbf{A}_{m \times n} = \mathbf{U}_{m \times m} \mathbf{\Sigma}_{m \times n} \mathbf{V}^T_{n \times n} \]


\(\mathbf{U}\): Set of orthonormal eigen vectors of \(\mathbf{AA^T}_{m \times m} \)
\(\mathbf{V}\): Set of orthonormal eigen vectors of \(\mathbf{A^TA}_{n \times n} \)
\(\mathbf{\Sigma}\): Rectangular diagonal matrix, whose diagonal values are called singular values; square root of non-zero eigen values of \(\mathbf{AA^T}\).

Note: The number of non-zero diagonal entries in \(\mathbf{\Sigma}\) = rank of matrix \(\mathbf{A}\).
Rank (r): Number of linearly independent rows or columns of a matrix.

\( \Sigma = \begin{bmatrix} \sigma_{11} & 0 & \cdots & 0 \\ \\ 0 & \sigma_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ \cdots & \cdots & \sigma_{rr} & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix} \), such that, \(\sigma_{11} \geq \sigma_{22} \geq \cdots \geq \sigma_{rr}\), r = rank of \(\mathbf{A}\).

Singular value decomposition, thus refers to the set of scale factors \(\mathbf{\Sigma}\) that are fundamentally linked to the matrix’s singularity and rank.



πŸ’‘ Suppose a satellite takes picture of objects in space and sends them to earth. Size of each picture = 1000x1000 pixels.
How can we compress the image size to save satellite bandwidth?

We can perform singular value decomposition on the image matrix and find the minimum number of top ranks required to
to successfully reconstruct the original image back.
Say we performed the SVD on the image matrix and found that top 20 rank singular values, out of 1000, are sufficient to tell what the picture is about.

\(\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T \) \( = u_1 \sigma_1 v_1^T + u_2 \sigma_2 v_2^T + \cdots + u_r \sigma_r v_r^T \), where \(r = 20\)
A = sum of ‘r=20’ matrices of rank=1.
Now, we need to send only the \(u_i, \sigma_i , v_i\) values for i=20, i.e, top 20 ranks to earth
and then do the calculation to reconstruct the approximation of original image.
\(\mathbf{u} = \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_{1000} \end{bmatrix}\) \(\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_{1000} \end{bmatrix}\)

So, we need to send data corresponding to 20 (rank=1) matrices, i.e, \(u_i, v_i ~and~ \sigma_i\) = (1000 + 1000 + 1)20 = 200020 pixels (approx)

Therefore, compression rate = (2000*20)/(10^6) = 1/25



πŸ“˜

Low Rank Approximation:
The process of approximating any matrix by a matrix of a lower rank, using singular value decomposition.
It is used for data compression.

Any matrix A of rank ‘r’ can be written as sum of rank=1 outer products:

\[ \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^T \]

\( \mathbf{u}_i : i^{th}\) column vector of \(\mathbf{U}\)
\( \mathbf{v}_i^T : i^{th}\) column vector of \(\mathbf{V}^T\)
\( \sigma_i : i^{th}\) singular value, i.e, diagonal entry of \(\mathbf{\Sigma}\)

Since, the singular values are arranged from largest to smallest, i.e, \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r\),
The top ranks capture the vast majority of the information or variance in the matrix.

So, in order to get the best rank-k approximation of the matrix, we simply truncate the summation after the k’th term.

\[ \mathbf{A_k} = \mathbf{U} \mathbf{\Sigma_k} \mathbf{V}^T = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T \]

The approximation \(\mathbf{A_k}\) is called the low rank approximation of \(\mathbf{A}\), which is achieved by keeping only the largest singular values and corresponding vectors.

Applications:

  1. Image compression
  2. Data compression, such as, LoRA (Low Rank Adaptation)



End of Section

6 - Vector & Matrix Calculus

Vector & Matrix Calculus

In this section we will understand Derivatives of Vector & Matrix, Jacobian and Hessian.



End of Section

7 - Vector Norms

Vector & Matrix Norms

In this section we will understand about Vector Norms.


πŸ“˜

Vector Norm:
It is a measure of the size of a vector or distance from the origin.
Vector norm is a function that maps a vector to a real number, i.e, \({\| \cdot \|} : \mathbb{R}^n \rightarrow \mathbb{R}\).
Vector Norm should satisfy following 3 properties:

  1. Non-Negativity:
    Norm is always greater than or equal to zero,
    \( {\| x \|} \ge 0\), and \( {\| x \|} = 0\), if and only if \(\vec{x} = \vec{0}\).
  2. Homogeneity (or Scaling):
    \( {\| \alpha x \|} = |\alpha| {\| x \|} \).
  3. Triangle Inequality:
    \( {\| x + y \|} \le {\| x \|} + {\| y \|} \).

P-Norm:
It is a generalised form of most common family of vector norms.
It is defined as:

\[ {\| x \|}_p = (\sum_{i=1}^n |x_i|^p)^{1/p} \]


We can change the value of \(p\) to get different norms.

L1 Norm:
It is the sum of absolute values of all the elements of a vector; also known as Manhattan distance.
p=1:

\[ {\| x \|_1} = \sum_{i=1}^n |x_i| = |x_1| + |x_2| + ... + |x_n| \]

L2 Norm:
It is the square root of the sum of squares of all the elements of a vector; also known as Euclidean distance.
p=2:

\[ {\| x \|_2} = (\sum_{i=1}^n x_i^2)^{1/2} = \sqrt{x_1^2 + x_2^2 + ... + x_n^2} \]

L-\(infty\) Norm:
It is the maximum of absolute values of all the elements of a vector; also known as Chebyshev distance.
p=\infty:

\[ {\| x \|_\infty} = \max |x_i| = \lim_{p \to \infty} (\sum_{i=1}^n |x_i|^p)^{1/p}\]
For example:

  1. Let, vector \(\mathbf{x} = \begin{bmatrix} 3 \\ \\ -4 \end{bmatrix}\), then
    \({\| x \|_1} = |3| + |-4| = 7\)
    \({\| x \|_2} = \sqrt{3^2 + (-4)^2} = \sqrt{25} = 5\)
    \({\| x \|_\infty} = max(|3|, |-4|) = max(3, 4) = 4\)

πŸ“˜

Matrix Norm:
It is a function that assigns non-negative size or magnitude to a matrix.
Matrix Norm is a function that maps a matrix to a non-negative real number, i.e, \({\| \cdot \|} : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}\)
It should satisfy following 3 properties:

  1. Non-Negativity:
    Norm is always greater than or equal to zero,
    \( {\| x \|} \ge 0\), and \( {\| x \|} = 0\), if and only if \(\vec{x} = \vec{0}\).
  2. Homogeneity (or Scaling):
    \( {\| \alpha x \|} = |\alpha| {\| x \|} \).
  3. Triangle Inequality:
    \( {\| x + y \|} \le {\| x \|} + {\| y \|} \).

There are 2 types of matrix norms:

  1. Element wise norms, e.g,, Frobenius norm
  2. Vector induced norms Frobenius Norm:


Frobenius Norm:
It is equivalent to the Euclidean norm of the matrix if it were flattened into a single vector.
If A is a matrix of size \(m \times n\), then, Frobenius norm is defined as:

\[ {\| A \|_F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n a_{ij}^2} = \sqrt{Trace(A^TA)} = \sqrt{\sum_i \sigma_i^2}\]


\(\sigma_i\) is the \(i\)th singular value of matrix A.

Vector Induced Norm:
It measures the maximum stretching a matrix can apply when multiplied with a vector,
where the vector has a unit length under the chosen vector norm.

Matrix Induced by Vector P-Norm:
P-Norm:

\[ {\| A \|_p} = \max_{{\| x \|_p} =1} \frac{\| Ax \|_p}{\| x \|_p} \]


P=1 Norm:

\[ {\| A \|_1} = \max_{1 \le j \le n } \sum_{i=1}^m |a_{ij}| = \text{ max absolute column sum } \]

P=\(\infty\) Norm:

\[ {\| A \|_\infty} = \max_{1 \le i \le m } \sum_{j=1}^n |a_{ij}| = \text{ max absolute row sum } \]

P=2 Norm:
Also called Spectral norm, i.e, maximum factor by which the matrix can stretch a unit vector in Euclidean norm.

\[ {\| A \|_2} = \sigma_{max}(A) = \text{ max singular value of matrix } \]

For example:

  1. Let, matrix \(\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix}\), then find Frobenius norm.

    \({\| A \|_F} = \sqrt{a_{11}^2 + a_{12}^2 + a_{21}^2 + a_{22}^2}\)

    \(\mathbf{A}^T = \begin{bmatrix} a_{11} & a_{22} \\ \\ a_{12} & a_{22} \end{bmatrix}\)

    => \(\mathbf{A^TA} = \begin{bmatrix} a_{11} & a_{22} \\ \\ a_{12} & a_{22} \end{bmatrix} \begin{bmatrix} a_{11} & a_{12} \\ \\ a_{21} & a_{22} \end{bmatrix} = \begin{bmatrix} a_{11}^2 + a_{12}^2 & a_{11}.a_{21} + a_{12}.a_{22} \\ \\ a_{21}.a_{11} + a_{22}.a_{12} & a_{21}^2 + a_{22}^2 \end{bmatrix} \)

    Therefore, \(Trace(\mathbf{A^TA}) = a_{11}^2 + a_{12}^2 + a_{21}^2 + a_{22}^2\)

    => \({\| A \|_F} = \sqrt{Trace(A^TA)} = \sqrt{a_{11}^2 + a_{12}^2 + a_{21}^2 + a_{22}^2}\)

  2. Let, matrix \(\mathbf{A} = \begin{bmatrix} 1 & -2 & 3 \\ \\ 4 & 5 & -6 \end{bmatrix}\), then

Column 1 absolute value sum = |1|+|4| = 5
Column 2 absolute value sum = |-2|+|5|= 7
Column 3 absolute value sum = |3|+|-6|= 9

Row 1 absolute value sum = |1|+|-2|+|3| = 6
Row 2 absolute value sum = |4|+|5|+|-6| = 15

\({\| A \|_1} = max(5,7,9) = 9\) = max column absolute value sum.
\({\| A \|_\infty} = max(6,15) = 15\) = max row absolute value sum.

  1. Let, matrix \(\mathbf{A} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix}\), then find spectral norm.
    Spectral norm can be found using the singular value decomposition, in order to get the largest singular value.
    \({\| A \|_2} = \sigma_{max}(A) \)

\(\mathbf{A} = \mathbf{U \Sigma V^T} \) , where \(\mathbf{U} = \mathbf{AA^T} \), \(\mathbf{V} = \mathbf{A^TA} \)

Let’s find the largest eigen value of \(\mathbf{A^TA} \), square root of which will give the largest singular value of \(\mathbf{A}\).

\( \mathbf{V} = \mathbf{A^TA} = \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ \\ 4 & 5 \end{bmatrix} \)

Now, lets find the eigen values of the above matrix V:
\(det(V-\lambda I) = 0 \)

=> \( det\begin{bmatrix} 5 - \lambda & 4 \\ \\ 4 & 5- \lambda \end{bmatrix} = 0 \)

=> \( (5 - \lambda)^2 - 16 = 0 \)
=> \( (5 - \lambda) = \pm 4 \)
=> \( (5 - \lambda) = 4 \) or \( (5 - \lambda) = -4 \)
=> \( \lambda = 1 \) or \( \lambda = 9 \)
=> Largest Singular Value = Square root of largest eigen value = \(\sqrt{9} = 3\)
Therefore, \({\| A \|_2} = \sigma_{max}(A) = 3\)



End of Section

8 - Hyperplane

Equation of a Hyperplane

In this section we will understand the Equation of a Hyperplane .


πŸ’‘ What is the equation of a line ?

Equation of a line is of the form \(y = mx + c\).
To represent a line in 2D space, we need 2 things:

  1. m = slope or direction of the line
  2. c = y-intercept or distance from the origin


πŸ’‘ Consider a line as a hyperplane in 2D space. Let the unit vector point towards the positive x-axis direction.
What is the direction of the hyperplane w.r.t the origin and the direction of the unit vector ?

Equation of a hyperplane is: \(\pi_d = \mathbf{w}^\top \mathbf{x} + w_0 = 0\)
Let, the equation of the line/hyperplane in 2D be:
\(\pi_d = 1.x + 0.y + w_0 = x + w_0 = 0\)

Case 1: \(w_0 < 0\), say \(w_0 = -5\)
Therefore, equation of hyperplane: \( x - 5 = 0 => x = 5\)
Here, the hyperplane(line) is located in the same direction as the unit vector w.r.t the origin,
i.e, towards the +ve x-axis direction.

Case 2: \(w_0 > 0\), say \(w_0 = 5\)
Therefore, equation of hyperplane: \( x + 5 = 0 => x = -5\)
Here, the hyperplane(line) is located in the opposite direction as the unit vector w.r.t the origin,
i.e, towards the -ve x-axis direction.




End of Section