This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Anomaly Detection

Anomaly/Outlier/Novelty Detection

1 - Anomaly Detection

Anomaly Detection Introduction

What is Anomaly?

🦄 Anomaly is a rare item, event or observation which deviates significantly from the majority of the data and does not conform to a well-defined notion of normal behavior.

Note: Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Anomaly Detection
🐙 Anomaly detection (Outlier detection or Novelty detection) is the identification of unusual patterns or anomalies or outliers in a given dataset.
What to do with Outliers ?

Remove Outliers:

  • Rejection or omission of outliers from the data to aid statistical analysis, for example to compute the mean or standard deviation of the dataset.
  • Remove outliers for better predictions from models, such as linear regression.

🔦Focus on Outliers:

  • Fraud detection in banking and financial services.
  • Cyber-security: intrusion detection, malware, or unusual user access patterns.
Anomaly Detection Methods 🐉
  • Supervised
  • Semi-Supervised
  • Unsupervised (most common) ✅

Note: Labeled anomaly data is often unavailable in real-world scenarios.

Known Methods 🐈
  • Statistical Methods: Z-Score, large value means outlier, IQR, point beyond fences (Q1 - 1.5*IQR or Q3 + 1.5*IQR) is flagged as an outlier.
  • Distance Based: KNN, points far from their neighbors as potential anomalies.
  • Density Based: DBSCAN, points in low density regions are considered outliers.
  • Clustering Based: K-Means, points far from cluster centroids that do not fit any cluster are anomalies.
Unsupervised Methods 🦅
  • Elliptic Envelope (MCD - Minimum Covariance Determinant)
  • One-Class SVM (OC-SVM)
  • Local Outlier Factor (LOF)
  • Isolation Forest (iForest)
  • RANSAC (Random Sample Consensus)



End of Section

2 - Elliptic Envelope

Elliptic Envelope

Use Case 🐝

Detect anomalies in multivariate Gaussian data, such as, biometric data (height/weight) where features are normally distributed and correlated.

Assumption: The data can be modeled by a Gaussian distribution.

Intuition 💡

In a normal distribution, most data points cluster around the mean, and the probability density decreases as we move farther away from the center.

images/machine_learning/unsupervised/anomaly_detection/elliptic_envelope/slide_03_01.png
Issue with Euclidean Distance 🐲

🌍 Euclidean distance measures the simple straight-line distance from the center of the cloud.

👉If the data is spherical, this works fine.

🦕 However, real-world data is often stretched or skewed (e.g., taller people are generally heavier), due to correlations between variables, forming an elliptical shape.

images/machine_learning/unsupervised/anomaly_detection/elliptic_envelope/slide_05_01.png
Mahalanobis Distance (Solution)

⭐️Mahalanobis distance essentially re-scales the data so that the elliptical distribution appears spherical, and then measures the Euclidean distance in that transformed space.

👉This way, it measures how many standard deviations(\(\sigma\)) away a point is from the mean, considering the data’s spread and correlation (covariance).

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]
Problem 🦀
Standard methods (like Z-score \(z = \frac{x-\mu}{\sigma}\)) fail because they are easily skewed by the outliers they are trying to find.
Solution 🦉
💡Instead of using all data, we find a ‘clean’ subset of the data that is most tightly packed and use only that subset to define the ‘normal’ ellipse.
Goal 🎯

👉Find the most dense core of the data.

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]

🦣 Determinant of covariance matrix \(\Sigma\) represents the volume of the ellipsoid.

⏲️ Therefore, minimize \(|\Sigma|\) to find the tight core.

👉 \(\text {Small} ~ \Sigma \rightarrow \text {Large} ~\Sigma ^{-1}\rightarrow \text {Large} ~ D_{M} ~\text {for outliers}\)

Minimum Covariance Determinant (MCD) Algorithm ⚙️

MCD algorithm is used to find the covariance matrix \(\Sigma\) with minimum determinant, so that the volume of the ellipsoid is minimized.

  1. Initialization: Select several random subsets of size h < n (default h = \(\frac{n+d+1}{2}\), d = # dimensions), representing ‘robustmajority of the data.
  2. Calculate preliminary mean (\(\mu\)) and covariance (\(\Sigma\)) for each random subset.
  3. Concentration Step: Iterative core of the algorithm designed to ‘tighten’ the ellipsoid.
    • Calculate Distances: Compute the Mahalanobis distance of all ’n’ points in the dataset from the current subset’s mean (\(\mu\)) and covariance (\(\Sigma\)).
    • Select New Subset: Identify the ‘h’ points with the smallest Mahalanobis distances.
      • These are the points most centrally located relative to the current ellipsoid.
    • Update Estimates: Calculate a new and based only on these ‘h’ most central points.
    • Repeat 🔁: The steps repeat until the determinant stops shrinking.

Note: Select the best subset that achieved the absolute minimum determinant.

Limitations
  • Assumptions
    • Gaussian data.
    • Unimodal data (single center).
  • Cost 💰of covariance matrix \(\Sigma^{-1}\) inversion is O(d^3).



End of Section

3 - One Class SVM

One Class SVM

Use Case (Novelty Detection)🐝

⭐️Only one class of data (normal, non-outlier) is available for training, making standard supervised learning models impossible.

e.g. Only normal observations are available for fraud detection, cyber attack, fault detection etc.

Intuition
images/machine_learning/unsupervised/anomaly_detection/one_class_svm/slide_02_01.png
Problem 🦀

🦂 The core problem is to build a model that can distinguish between ‘normal’ and ‘anomalous’ data when we only have examples of the ‘normal’ class during training.

🦖 We need to find a decision boundary that is as compact as possible while still encompassing the bulk of the training data.

Solution 🦉
💡Instead of finding a hyperplane that separates two different classes, we find a hyperplane that best separates the normal data points from the origin (0,0) in the feature space 🚀.
Goal 🎯

🦍 Define a boundary for a single class in high-dimensional space where data might be non-linearly distributed (e.g.‘U’ shape).

🦧 Use the Kernel Trick to project data into a higher-dimensional space and find a hyperplane that separates the data from the origin with the maximum margin.

One Class SVM

⭐️OC-SVM, as introduced by Bernhard Schölkopf et al., uses a hyperplane ‘H’ defined by a weight vector \(\mathbf{w}\) and a bias term \(\rho\).

👉Solve the following optimization problem:

\[\min _{\mathbf{w},\xi _{i},\rho }\frac{1}{2}||\mathbf{w}||^{2}+\frac{1}{\nu N}\sum _{i=1}^{N}\xi _{i}-\rho \]

Subject to constraints:

\[\mathbf{w}\cdot \phi (\mathbf{x}_{i})\ge \rho -\xi _{i}\quad \text{and}\quad \xi _{i}\ge 0,\quad \text{for\ }i=1,\dots ,N\]
Explanation of Terms
  • \(\mathbf{x}_{i}\): i-th training data point.
  • \(\phi (\mathbf{x}_{i})\): RBF kernel function \(K(x, y) = \exp(-\gamma \|x-y\|^2)\) that maps the data into a higher-dimensional feature space, making it easier to separate from the origin.
  • \(\mathbf{w}\): normal vector to the separating hyperplane.
  • \(\rho\): scalar bias term that determines the offset of the hyperplane from the origin.
  • \(\xi_i\): Slack variables that allow some data points to fall on the ‘wrong’ side of the hyperplane (inside the anomalous region) to prevent overfitting.
  • N: total number of training points.
  • \(\nu\): hyper-parameter between 0 and 1. It acts as an upper bound on the fraction of outliers (training data points outside the boundary) and a lower bound on the fraction of support vectors.
Working 🦇
  • \(\frac{1}{2}\|\mathbf{w}\|^{2}\): aims to maximize the margin/compactness of the region.
  • \(\frac{1}{\nu N}\sum _{i=1}^{N}\xi _{i}-\rho\): penalizes points (outliers) that violate the boundary constraints.

After solving the optimization problem using standard quadratic programming techniques, we obtain the optimal \(\mathbf{w}^{*}\) and \(\rho ^{*}\).

For a new data point \(x_{new}\), decision function is:

\[f(\mathbf{x}_{\text{new}})=\text{sign}(\mathbf{w}^{*}\cdot \phi (\mathbf{x}_{\text{new}})-\rho ^{*})\]
  • \(f(\mathbf{x}_{\text{new}})\ge 0\): normal point.

  • \(f(\mathbf{x}_{\text{new}})< 0\): anomalous point (outlier).

    images/machine_learning/unsupervised/anomaly_detection/one_class_svm/oc_svm_plot.png



End of Section

4 - Local Outlier Factor

Local Outlier Factor

Use Case 🐝
⭐️Geographic fraud detection:
A $100 transaction might be ‘normal’ in New York but an ‘outlier’ in a small rural village.
Intuition 💡

‘Local context matters.’

Global distance metrics fail when density is non-uniform.

🦄 An outlier is a point that is ‘unusualrelative to its immediate neighbors, regardless of how far it is from the center of the entire dataset.

Problem 🦀

💡Traditional distance-based outlier detection methods, such as, KNN, often struggle with datasets where data is clustered at varying densities.

  • A point in a sparse region might be considered an outlier by a global method, even if it is a normal part of that sparse cluster.
  • Conversely, a point very close to a dense cluster might be an outlier relative to that specific neighborhood.
Solution 🦉

👉Calculate the relative density of a point compared to its immediate neighborhood.

e.g. If the neighbors are in a dense crowd and the point is not, it is an outlier.

Goal 🎯
📌Compare the density of a point to the density of its neighbors.
Local Outlier Factor (LOF)

Local Outlier Factor (LOF) is a density-based algorithm designed to detect anomalies by measuring the local deviation of a data point relative to its neighbors.

👉Size of the red circle represents the LOF score.

images/machine_learning/unsupervised/anomaly_detection/local_outlier_factor/slide_08_01.png
LOF Score Calculation 🔢
  1. K-Distance (\(k\text{-dist}(p)\)):
    • The distance from point ‘p’ to its k-th nearest neighbor.
  2. Reachability Distance (\(\text{reach-dist}_{k}(p,o)\)): \[\text{reach-dist}_{k}(p,o)=\max \{k\text{-dist}(o),\text{dist}(p,o)\}\]
    • \(\text{dist}(p,o)\): actual Euclidean distance between ‘p’ and ‘o’.
    • This acts as ‘smoothing’ factor.
    • If point ‘p’ is very close to ‘o’ (inside o’s k-neighborhood), round up distance to \(k\text{-dist}(o)\).
  3. Local Reachability Density (\(\text{lrd}_{k}(p)\)):
    • The inverse of the average reachability distance from ‘p’ to its k-neighbors (\(N_{k}(p)\)). \[\text{lrd}_{k}(p)=\left[\frac{1}{|N_{k}(p)|}\sum _{o\in N_{k}(p)}\text{reach-dist}_{k}(p,o)\right]^{-1}\]
      • High LRD: Neighbors are very close; the point is in a dense region.
      • Low LRD: Neighbors are far away; the point is in a sparse region.
  4. Local Outlier Factor (\(\text{LOF}_{k}(p)\)):
    • The ratio of the average ‘lrd’ of p’s neighbors to p’s own ‘lrd’. \[\text{LOF}_{k}(p)=\frac{1}{|N_{k}(p)|}\sum _{o\in N_{k}(p)}\frac{\text{lrd}_{k}(o)}{\text{lrd}_{k}(p)}\]
images/machine_learning/unsupervised/anomaly_detection/local_outlier_factor/slide_10_01.png
LOF Score 🔢 Interpretation
  • LOF ≈ 1: Point ‘p’ has similar density to its neighbors (inlier).
  • LOF > 1: Point p’s density is much lower than its neighbors’ density (outlier).



End of Section

5 - Isolation Forest

Isolation Forest

Use Case 🐝

‘Large scale tabular data.’

Credit card fraud detection in datasets with millions of rows and hundreds of features.

Note: Supervised learning requires balanced, labeled datasets (normal vs. anomaly), which are rarely available in real-world scenarios like fraud or cyber-attacks.

Intuition 💡

🦥 ‘Flip the logic.’

🦄 ‘Anomalies’ are few and different, so they are much easier to isolate from the rest of the data than normal points.

Problem 🦀

🐦‍🔥 ‘Curse of dimensionality.’

🐎 Distance based (K-NN), and density based (LOF) algorithms require calculation of distance between all pair of points.

🐙 As the number of dimensions and data points grows, these calculations become exponentially more expensive 💰 and less effective.

Solution 🦉
Use a tree-based 🌲 approach with better time ⏰ complexity O(nlogn), making it highly scalable for massive datasets and robust in high-dimensional spaces without needing expensive distance metrics.
Goal 🎯

‘Randomly partition the data.’

🦄 If a point is an outlier, it will take fewer partitions (splits) to isolate it into a leaf 🍃 node compared to a point that is buried deep within a dense cluster of normal data.

Isolation Forest (iForest) 🌲🌳

🌳Isolation Forest uses an ensemble of ‘Isolation Trees’ (iTrees) 🌲.

👉iTree (Isolation Tree) 🌲 is a proper binary tree structure specifically designed to separate individual data points through random recursive partitioning.

Algorithm ⚙️
  1. Sub-sampling:
    • Select a random subset of data (typically 256 instances) to build an iTree.
  2. Tree Construction: Randomly select a feature.
    • Randomly select a split value between the minimum and maximum values of that feature.
    • Divide the data into two branches based on this split.
    • Repeat recursively until the point is isolated or a height limit is reached.
  3. Forest Creation:
    • Repeat 🔁 the process to create ‘N’ trees (typically 100).
  4. Inference:
    • Pass a new data point through all trees, calculate the average path length, and compute the anomaly score.
Scoring Function 📟

⭐️🦄 Assign an anomaly score based on the path length h(x) required to isolate a point ‘x’.

  • Path Length (h(x)): The number of edges ‘x’ traverses from the root node to a leaf node.
  • Average Path Length (c(n)): Since iTrees are structurally similar to Binary Search Trees (BST), the average path length for a dataset of size ’n’ is given by: \[c(n)=2H(n-1)-\frac{2(n-1)}{n}\]

where, H(i) is the harmonic number, estimated as \(\ln (i)+0.5772156649\) (Euler’s constant).

🦄 Anomaly Score

To normalize the score between 0 and 1, we define it as:

\[s(x,n)=2^{-\frac{E(h(x))}{c(n)}}\]

👉E(H(x)): is the average path length of across a forest of trees 🌲.

  • \(s\rightarrow 1\): Point is an anomaly; Path length is very short.
  • \(s\approx 0.5\): Point is normal, path length approximately equal to c(n).
  • \(s\rightarrow 0\): Point is normal; deeply buried point, path length is much larger than c(n).
images/machine_learning/unsupervised/anomaly_detection/isolation_forest/slide_12_01.tif
Drawbacks
  • Axis-Parallel Splits:
    • Standard iTrees 🌲 split only on one feature at a time, so:
      • We do not get a smooth decision boundary.
      • Anything off-axis has a higher probability of being marked as an outlier.
      • Note: Extended Isolation Forest fixes this by using random slopes.
  • Score Sensitivity: The threshold for what constitutes an ‘anomaly’ often requires manual tuning or domain knowledge.



End of Section

6 - RANSAC

RANSAC

Use Case 🐝
⭐️Estimate the parameters of a model from a set of observed data that contains a significant number of outliers.
Intuition 💡

👉Ordinary Least Squares use all data points to find a fit.

  • However, a single outlier can ‘pull’ the resulting line significantly, leading to a poor representative model.

💡If we pick a very small random subset of points, there is a higher probability that this small subset contains only good data (inliers) compared to a large set.

Problem 🦀
  • Ordinary Least Squares (OLS) minimizes the Sum of Errors.

    • A huge outlier has an exponentially large impact on the final line.
    images/machine_learning/unsupervised/anomaly_detection/ransac/slide_03_01.png
Solution 🦉

💡Instead of using all points, iteratively pick the smallest possible random subset to fit a model, then check (votes) how many other points in the dataset ‘agree’ with that model.

This gives the name to our algorithm:

  • Random: Random subsets.
  • Sampling: Small subsets.
  • Consensus: Agreement with other points.
RANSAC Algorithm ⚙️
  1. Random Sampling:
    • Randomly select a Minimal Sample Set (MSS) of ’n’ points from the input data ‘D’.
    • e.g. n=2 for a line, or n=3 for a plane in 3D.
  2. Model Fitting:
    • Compute the model parameters using only these ’n’ points.
  3. Test:
    • For all other points in ‘D’, calculate the error relative to the model.
    • 👉 Points with error < \(\tau\)(threshold) are added to the ‘Consensus Set’.
  4. Evaluate:
    • If the consensus set is larger than the previous best, save this model and set.
  5. Repeat 🔁:
    • Iterate ‘k’ times.
  6. Refine (Optional):
    • Once the best model is found, re-estimate it using all points in the final consensus set (usually via Least Squares) for a more precise fit.

👉Example:

images/machine_learning/unsupervised/anomaly_detection/ransac/slide_09_01.png
How Many Iterations ‘k’ ?

👉To ensure the algorithm finds a ‘clean’ sample set (no outliers) with a desired probability(often 99%), we use the following formula:

\[k=\frac{\log (1-P)}{\log (1-w^{n})}\]
  • k: Number of iterations required.
  • P: Probability that at least one random sample contains only inliers.
  • w: Ratio of inliers in the dataset (number of inliers / total points).
  • n: Number of points in the Minimal Sample Set.



End of Section