1 - K Means

K Means Clustering

Unsupervised Learning

🌍In real-world systems, labeled data is scarce and expensive 💰.

💡Unsupervised learning discovers inherent structure without human annotation.

👉Clustering answers: “Given a set of points, what natural groupings exist?”

Real-World 🌍 Motivations for Clustering
  • Customer Segmentation: Group users by behavior without predefined categories.
  • Image Compression: Reduce color palette by clustering pixel colors.
  • Anomaly Detection: Points far from any cluster are outliers.
  • Data Exploration: Understand structure before building supervised models.
  • Preprocessing: Create features from cluster assignments.
Key Insight 💡

💡Clustering assumes that ‘similar’ points should be grouped together.

👉But what is ‘similar’? This assumption drives everything.

Problem Statement ✍️

Given:

  • Dataset X = {x₁, x₂, …, xₙ} where xᵢ ∈ ℝᵈ.
  • Desired number of clusters ‘k'.

Find:

  • Cluster assignments C = {C₁, C₂, …, Cₖ}.

  • Such that points within clusters are ‘similar’.

  • And points across clusters are ‘dissimilar’.

    images/machine_learning/unsupervised/k_means/k_means_clustering/slide_07_01.png
Optimization Perspective

This is fundamentally an optimization problem, i.e, find parameters such that the value is minimum/maximum. We need:

  • An objective function
    • what makes a clustering ‘good’?
  • An algorithm to optimize it
    • how do we find good clusters?
  • Evaluation metrics
    • how do we measure quality?
Optimization

Objective function:
👉Minimize the within-cluster sum of squares (WCSS).

\[J(C, \mu) = \sum_{j=1}^k \sum_{x_i \in C_j} \underbrace{\|x_i -\mu_j\|^2}_{\text{distance from mean}} \]
  • Where:
  • C = {C₁, …, Cₖ} are cluster assignments.
  • μⱼ is the centroid (mean) of cluster Cₖ.
  • ||·||² is squared Euclidean distance.

Note: Every point belongs to one and only one cluster.

Variance Decomposition

💡Within-Cluster Sum of Squares (WCSS) is nothing but variance.

⭐️ Total Variance = Within-Cluster Variance + Between-Cluster Variance

👉K-Means minimizes within-Cluster variance, which implicitly maximizes between-cluster separation.

Geometric Interpretation:

  • Each point is ‘pulled’ toward its cluster center.
  • The objective measures total squared distance of all points to their centers.
  • Lower J(C, μ) means tighter, more compact clusters.

Note: K-Means works best when clusters are roughly spherical, similarly sized, and well-separated.

images/machine_learning/unsupervised/k_means/k_means_clustering/slide_11_01.png
Combinatorial Explosion 💥

⭐️The problem requires partitioning ’n’ observations into ‘k’ distinct, non-overlapping clusters, which is given by the Stirling number of the second kind, which grows at a rate roughly equal to \(k^n/k!\).

\[S(n,k)=\frac{1}{k!}\sum _{j=0}^{k}(-1)^{k-j}{k \choose j}j^{n}\]

\[S(100,2)=2^{100-1}-1=2^{99}-1\]

\[2^{99}\approx 6.338\times 10^{29}\]

👉This large number of possible combinations makes the problem NP-Hard.

🦉The k-means optimization problem is NP-hard because it belongs to a class of problems for which no efficient (polynomial-time ⏰) algorithm is known to exist.



End of Section

2 - Lloyds Algorithm

Lloyds Algorithm

Idea 💡
Since, we cannot enumerate all partitions (i.e, partitioning ’n’ observations into ‘k’ distinct, non-overlapping clusters), Lloyd’s algorithm provides a local search heuristic (approximate algorithm).
Lloyd's Algorithm ⚙️

Iterative method for partitioning ’n’ data points into ‘k’ groups by repeatedly assigning data points to the nearest centroid (mean) and then recalculating centroids until assignments stabilize, aiming to minimize within-cluster variance.

📥Input: X = {x₁, …, xₙ}, ‘k’ (number of clusters)

📤Output: ‘C’ (clusters), ‘μ’ (centroids)

👉Steps:

  1. Initialize: Randomly choose ‘k’ cluster centroids μ₁, …, μₖ.
  2. Repeat until convergence, i.e, until cluster assignments and centroids no longer change significantly.
  • a) Assignment: Assign each data point to the cluster whose centroid is closest (usually using Euclidean distance).
    • For each point xᵢ: cᵢ = argminⱼ ||xᵢ - μⱼ||²
  • b) Update: Recalculate the centroid (mean) of each cluster.
    • For each cluster j: μⱼ = (1/|Cⱼ|) Σₓᵢ∈Cⱼ xᵢ
Issues🚨
  • Initialization sensitive, different initialization may lead to different clusters.
  • Tries to make each cluster of same size that may not be the case in real world.
  • Tries to make each cluster with same density(variance)
  • Does not work well with non-globular(spherical) data.

👉See how 2 different runs of K-Means algorithm gives totally different clusters.

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_06_01.png
images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_06_02.png

👉Also, K-Means does not work well with non-spherical clusters, or clusters with different densities and sizes.

images/machine_learning/unsupervised/k_means/lloyds_algorithm/slide_07_01.png
Solutions
✅ Do multiple runs 🏃‍♂️and choose the clustering with minimum error.
✅ Do not select initial points randomly, but some logic, such as, K-means++ algorithm.
✅ Use hierarchical clustering or density based clustering DBSCAN.



End of Section

3 - K Means++

K Means++ Algorithm

Issues with Random Initialization
  • If two initial centroids belong to the same natural cluster, the algorithm will likely split that natural cluster in half and be forced to merge two other distinct clusters elsewhere to compensate.
  • Inconsistent; different runs may lead to different clusters.
  • Slow convergence; Centroids may need to travel much farther across the feature space, requiring more iterations.

👉Example for different K-Means algorithm runs give different clusters

images/machine_learning/unsupervised/k_means/k_means_plus_plus/slide_02_01.png
images/machine_learning/unsupervised/k_means/k_means_plus_plus/slide_02_02.png
K-Means++ Algorithm

💡Addresses the issue due to random initialization by aiming to spread out the initial centroids across the data points.

Steps:

  1. Select the first centroid: Choose one data point randomly from the dataset to be the first centroid.
  2. Calculate distances: For every data point x not yet selected as a centroid, calculate the distance, D(x), between x and the nearest centroid chosen so far.
  3. Select the next centroid: Choose the next centroid from the remaining data points with a probability proportional to D(x)^2.
    This makes it more likely that a point far from existing centroids is selected, ensuring the initial centroids are well-dispersed.
  4. Repeat: Repeat steps 2 and 3 until ‘k’ number of centroids are selected.
  5. Run standard K-means: Once the initial centroids are chosen, the standard k-means algorithm proceeds with assigning data points to the nearest centroid and iteratively updating the centroids until convergence.
Problem 🚨
🦀 If our data is extremely noisy (outliers), the probabilistic logic (\(\propto D(x)^2\)) might accidentally pick an outlier as a cluster center.
Solution ✅
Do robust preprocessing to remove outliers or use K-Medoids algorithm.



End of Section

4 - K Medoid

K Medoid

Issues with K-Means
  • In K-Means, the centroid is the arithmetic mean of the cluster. The mean is very sensitive to outliers.
  • Not interpretable; centroid is the mean of cluster data points and may not be an actual data point, hence not representative.
Medoid

️Medoid is a specific data point from a dataset that acts as the ‘center’ or most representative member of its cluster.

👉It is defined as the object within a cluster whose average dissimilarity (distance) to all other members in that same cluster is the smallest.

K-Medoids (PAM) Algorithm

💡Selects actual data points from the dataset as cluster representatives, called medoids (most centrally located).

👉a.k.a Partitioning Around Medoids(PAM).

Steps:

  1. Initialization: Select ‘k’ data points from the dataset as the initial medoids using K-Means++ algorithm.
  2. Assignment: Calculate the distance (e.g., Euclidean or Manhattan) from each non-medoid point to all medoids and assign each point to the cluster of its nearest medoid.
  3. Update (Swap):
  • For each cluster, swap current medoid with a non-medoid point from the same cluster.
  • For each swap, calculate the total cost 💰(sum of distances from medoid).
  • Pick the medoid with minimum cost 💰.
  1. Repeat🔁: Repeat the assignment and update steps until (convergence), i.e, medoids no longer change or a maximum number of iterations is reached.

Note: Kind of brute-force algorithm, computationally expensive for large dataset.

Advantages
Robust to Outliers: Since medoids are actual data points rather than averages, extreme values (outliers) do not skew the center of the cluster as they do in K-Means.
Flexible Distance Metrics: It can work with any dissimilarity measure (Manhattan, Cosine similarity), making it ideal for categorical or non-Euclidean data.
Interpretable Results: Cluster centers are real observations (e.g., a ‘typical’ customer profile), which makes the output easier to explain to stakeholders.


End of Section

5 - Clustering Quality Metrics

Clustering Quality Metrics

How to Evaluate Quality of Clustering?
  • 👉 Elbow Method: Quickest to compute; good for initial EDA (Exploratory Data Analysis).
  • 👉 Dunn Index: Focuses on the ‘gap’ between the closest clusters.
  • 👉 Silhouette Score: Balances compactness and separation.
  • 👉 Domain specific knowledge and system constraints.
Elbow Method

️Heuristic used to determine the optimal number of clusters (k) for clustering by visualizing how the quality of clustering improves as ‘k’ increases.

🎯The goal is to find a value of ‘k’ where adding more clusters provides a diminishing return in terms of variance reduction.

images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_02_01.png
Dunn Index [0, \(\infty\))

⭐️Clustering quality evaluation metric that measures: separation (between clusters) and compactness (within clusters)

Note: A higher Dunn Index value indicates better clustering, meaning clusters are well-separated from each other and compact.

👉Dunn Index Formula:

\[DI = \frac{\text{Minimum Inter-Cluster Distance(between different clusters)}}{\text{Maximum Intra-Cluster Distance(within a cluster)}}\]

\[DI = \frac{\min_{1 \le i < j \le k} \delta(C_i, C_j)}{\max_{1 \le l \le k} \Delta(C_l)}\]
images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_06_01.png

👉Let’s understand the terms in the above formula:

  • \(\delta(C_i, C_j)\) (Inter-Cluster Distance):

    • Measures how ‘far apart’ the clusters are.
    • Distance between the two closest points of different clusters (Single-Linkage distance). \[\delta(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)\]
  • \(\Delta(C_l)\) (Intra-Cluster Diameter):

    • Measures how ‘spread out’ a cluster is.
    • Distance between the two furthest points within the same cluster (Complete-Linkage distance). \[\Delta(C_l) = \max_{x, y \in C_l} d(x, y)\]
Measure of Closeness
  • Single Linkage (MIN): Uses the minimum distance between any two points in different clusters.
  • Complete Linkage (MAX): Uses the maximum distance between any two points in same cluster.



End of Section

6 - Silhouette Score

Silhouette Score

How to Evaluate Quality of Clustering?
  • Elbow Method: Quickest to compute; good for initial EDA.
  • Dunn Index: Focuses on the ‘gap’ between the closest clusters.
    ——- We have seen the above 2 methods in the previous section ———-
  • 👉 Silhouette Score: Balances compactness and separation.
  • 👉 Domain specific knowledge and system constraints.
Silhouette Score [-1, 1]

⭐️Clustering quality evaluation metric that measures how similar a data point is to its own cluster (cohesion) compared to other clusters (separation).

Note: Higher scores (closer to 1) indicate better-defined, distinct clusters, while scores near 0 suggest overlapping clusters, and negative scores mean points might be in the wrong cluster.

Silhouette Score Formula

Silhouette score for point ‘i’ is the difference between separation b(i) and cohesion a(i), normalized by the larger of the two.

\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]

Note: The Global Silhouette Score is simply the mean of s(i) for all points in the dataset.

👉Example for Silhouette Score:

images/machine_learning/unsupervised/k_means/silhouette_score/slide_04_01.png

👉Example for Silhouette Score of 0(Border Point) and negative(Wrong Cluster).

images/machine_learning/unsupervised/k_means/silhouette_score/slide_05_01.png

🦉Now let’s understand the terms in Silhouette Score in detail.

Cohesion a(i)

Average distance between point ‘i’ and all other points in the same cluster.

\[a(i) = \frac{1}{|C_A| - 1} \sum_{j \in C_A, i \neq j} d(i, j)\]

Note: Lower a(i) means the point is well-matched to its own cluster.

Separation b(i)

Average distance between point ‘i’ and all points in the nearest neighboring cluster (the cluster that ‘i’ is not a part of, but is closest to).

\[b(i) = \min_{C_B \neq C_A} \frac{1}{|C_B|} \sum_{j \in C_B} d(i, j)\]

Note: Higher b(i) means the point is very far from the next closest cluster.

Silhouette Plot

⭐️A silhouette plot is a graphical tool used to evaluate the quality of clustering algorithms (like K-Means), showing how well each data point fits within its cluster.

👉Each bar gives the average silhouette score of the points assigned to that cluster.

images/machine_learning/unsupervised/k_means/silhouette_score/slide_09_01.png
Geometric Interpretation
  • ⛳️ Like K-Means, the Silhouette Score (when using Euclidean distance) assumes convex clusters.

  • 🌘 If we use it on ‘Moon’ shaped clusters, it will give a low score even if the clusters are perfectly separated, because the ‘average distance’ to a neighbor might be small due to the curvature of the manifold.

    images/machine_learning/unsupervised/k_means/silhouette_score/slide_11_01.png
Silhouette Score Vs Dunn Index
images/machine_learning/unsupervised/k_means/silhouette_score/silhouette_score_vs_dunn_index.png

Choose Silhouette Score if:
✅ Need a human-interpretable metric to present to stakeholders.
✅ Dealing with real-world noise and overlapping ‘fuzzy’ boundaries.
✅ Want to see which specific clusters are weak (using the plot).

Choose Dunn Index if:
✅ Performing ‘Hard Clustering’ where separation is a safety or business requirement.
✅ Data is pre-cleaned of outliers (e.g., in a curated embedding space).
✅ Need to compare different clustering algorithms (e.g., K-Means vs. DBSCAN) on a high-integrity task.



End of Section