Clustering Quality Metrics

Clustering Quality Metrics

How to Evaluate Quality of Clustering?
  • 👉 Elbow Method: Quickest to compute; good for initial EDA (Exploratory Data Analysis).
  • 👉 Dunn Index: Focuses on the ‘gap’ between the closest clusters.
  • 👉 Silhouette Score: Balances compactness and separation.
  • 👉 Domain specific knowledge and system constraints.
Elbow Method

️Heuristic used to determine the optimal number of clusters (k) for clustering by visualizing how the quality of clustering improves as ‘k’ increases.

🎯The goal is to find a value of ‘k’ where adding more clusters provides a diminishing return in terms of variance reduction.

images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_02_01.png
Dunn Index [0, \(\infty\))

⭐️Clustering quality evaluation metric that measures: separation (between clusters) and compactness (within clusters)

Note: A higher Dunn Index value indicates better clustering, meaning clusters are well-separated from each other and compact.

👉Dunn Index Formula:

\[DI = \frac{\text{Minimum Inter-Cluster Distance(between different clusters)}}{\text{Maximum Intra-Cluster Distance(within a cluster)}}\]

\[DI = \frac{\min_{1 \le i < j \le k} \delta(C_i, C_j)}{\max_{1 \le l \le k} \Delta(C_l)}\]
images/machine_learning/unsupervised/k_means/clustering_quality_metrics/slide_06_01.png

👉Let’s understand the terms in the above formula:

  • \(\delta(C_i, C_j)\) (Inter-Cluster Distance):

    • Measures how ‘far apart’ the clusters are.
    • Distance between the two closest points of different clusters (Single-Linkage distance). \[\delta(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)\]
  • \(\Delta(C_l)\) (Intra-Cluster Diameter):

    • Measures how ‘spread out’ a cluster is.
    • Distance between the two furthest points within the same cluster (Complete-Linkage distance). \[\Delta(C_l) = \max_{x, y \in C_l} d(x, y)\]
Measure of Closeness
  • Single Linkage (MIN): Uses the minimum distance between any two points in different clusters.
  • Complete Linkage (MAX): Uses the maximum distance between any two points in same cluster.



End of Section