This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

K Nearest Neighbors

K Nearest Neighbors (KNN)

1 - KNN Introduction

K Nearest Neighbors Introduction

Issues with Linear/Logistic Regression
  • Parametric models:
    • Rely on assumption that relationships between data points are linear.
    • For polynomial regression we need to find the degree of polynomial.
  • Training:
    • We need to train 🏃‍♂️the model for prediction.
K Nearest Neighbors
  • Simple: Intuitive way to classify data or predict values by finding similar existing data points (neighbors).

  • Non-Parametric: Makes no assumptions about the underlying data distribution.

  • No Training Required: KNN is a ‘lazy learner’, it does not require a formal training 🏃‍♂️ phase.

    images/machine_learning/supervised/k_nearest_neighbors/knn_intro/slide_03_01.tif
    images/machine_learning/supervised/k_nearest_neighbors/knn_intro/slide_01_01.png
KNN Algorithm

Given a query point \(x_q\) and a dataset, D = {\((x_i,y_i)_{i=1}^n, \quad x_i,y_i \in \mathbb{R}^d\)}, the algorithm finds a set of ‘k’ nearest neighbors \(\mathcal{N}_k(x_q) \subseteq D\).

Inference:

  1. Choose a value of ‘k’ (hyper-parameter); odd number.
  2. Calculate distance (Euclidean, Cosine etc.) between and every point in dataset and store in a distance list.
  3. Sort the distance list in ascending order; choose top ‘k’ data points.
  4. Make prediction:
    • Classification: Take majority vote of ‘k’ nearest neighbors and assign label.
    • Regression: Take the mean/median of ‘k’ nearest neighbors.

Note: Store entire dataset.

Time & Space Complexity
  • Storing Data: Space Complexity: O(nd)
  • Inference: Time Complexity ⏰: O(nd + nlogn)

Explanation:

  • Distance to all ’n’ points in ‘d’ dimensions: O(nd)
  • Sorting all ’n’ data points : O(nlogn)

Note: Brute force 🔨 KNN is unacceptable when ’n’ is very large, say billions.



End of Section

2 - KNN Optimizations

KNN Optimizations

Optimizations

Naive KNN needs some improvements to fix some of its drawbacks.

  • Standardization
  • Distance-Weighted KNN
  • Mahalanobis Distance
Standardization

⭐️Say one feature is ‘Annual Income’ (0-1M), and another feature is ‘Years of Experience’ (0-40).

👉The Euclidean distance will be almost entirely dominated by income 💵.

💡So, we do standardization of each feature, such that it has a mean, \(\mu\)=0 and variance,\(\sigma\)=1.

\[z=\frac{x-\mu}{\sigma}\]
Distance-Weighted KNN

⭐️Vanilla KNN treats the 1st nearest neighbor and the k-th nearest neighbor as equal.

💡A neighbor that is 0.1units away should have more influence than a neighbor that is 10 units away.

👉We assign weight 🏋️‍♀️ to each neighbor; most common strategy is inverse of squared distance.

\[w_i = \frac{1}{d(x_q, x_i)^2 + \epsilon}\]

Improvements:

  • Noise/Outlier: Reduces the impact of ‘noise’ or ‘outlier’ (distant neighbors).
  • Imbalanced Data: Closer points dominate, mitigating impact of imbalanced data.
    • e.g: If you have a query point surrounded by 2 very close ‘Class A’ points and 3 distant ‘Class B’ points, weighted 🏋️‍♀️ KNN will correctly pick ‘Class A'.
Mahalanobis Distance

⭐️Euclidean distance makes assumption that all the features are independent and provide unique information.

💡‘Height’ and ‘Weight’ are highly correlated.

👉If we use Euclidean distance, we are effectively ‘double-counting’ the size of the person.

🏇Mahalanobis distance measures distance in terms of standard deviations from the mean, accounting for the covariance between features.

\[d(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}\]

\(\Sigma\): Covariance matrix of the data

  • If \(\Sigma\) is identity matrix, Mahalanobis distance reduces to Euclidean distance.
  • If \(\Sigma\) is a diagonal matrix, Mahalanobis distance reduces to Normalized Euclidean distance.
Runtime Issue

🦀Naive KNN shifts all computation 💻 to inference time ⏰, and it is very slow.

  • To find the neighbor for one query, we must touch every single bit of the ‘nxd’ matrix.
  • If n=10^9,a single query would take seconds, but we need milliseconds.
Advanced Optimizations
  • Distance Weighted KNN
    • K-D Trees (d<20): Recursively partitions space into axis-aligned hyper-rectangles. O(log N ) search.
    • Ball Trees : High dimensional data; Haversine distance for geospatial 🌎 data.
  • Approximate Nearest Neighbors (ANN)
    • Locality Sensitive Hashing (LSH): Uses ‘bucketizing’ 🗑️ hashes. Points that are close have a high probability of having the same hash.
    • Hierarchical Navigable Small World (HNSW); Graph of vectors; Search is a ‘greedy walk’ across levels.
  • Product Quantization (Reduce memory 🧠 footprint 👣 of high dimensional vectors)
    • ScaNN (Google)
    • FAISS (Meta)
  • Dimensionality Reduction (Mitigate ‘Curse of Dimensionality’)
    • PCA



End of Section

3 - Curse Of Dimensionality

Curse Of Dimensionality

Euclidean Distance

While Euclidean distance(L norm) is the most frequently discussed, ‘Curse of Dimensionality’ impacts all Minkowski norms (\(L_p\))

\[L_p = (\sum |x_i|^p)^{\frac{1}{p}} \]

Note: ‘Curse of Dimensionality’ is largely a function of the exponent (p) in the distance calculation.

Issues with High Dimensional Data

Coined 🪙 by mathematician John Bellman in the 1960s while studying dynamic programming.

High dimensional data created following challenges:

  • Distance Concentration
  • Data Sparsity
  • Exponential Sample Requirement
Distance Concentration

💡Consider a hypercube in d-dimensions of side length = 1; Volume = \(1^d\) = 1
🧊 A smaller inner cube with side length = 1 - \(\epsilon\) ; Volume = \((1 -\epsilon)^d\)

\[\lim_{d \rightarrow \infty} (1 - \epsilon)^d = 0\]

🧐 This implies that almost all the volume of the high-dimensional cube lies near the ‘crust’.
👉e.g: if \(\epsilon\)= 0.01, d = 500; Volume of inner cube = \((1 -0.01)^{500}\) = \(0.99^{500}\) = 0.006 = 0.6%
🤔Consequently, all points become nearly equidistant, and the concept of ‘nearest’ or ‘neighborhoodloses its meaning.

Data Sparsity

⭐️The volume of the feature space increases exponentially with each added dimension.

👉To maintain the same data density found in a 1D space with 10 points, we would need \(10^{10}\)(10 billion) points in 10D space.

💡Because real-world datasets are rarely this large, the data becomes “sparse,” making it difficult to find truly similar neighbors.

Exponential Sample Requirement

⭐️To maintain a reliable result, the amount of training data needed must grow exponentially with the number of dimensions.

👉Without this growth, the model is highly prone to overfitting, where it learns from noise in the ‘sparse’ data rather than actual underlying patterns.

Note: For modern embeddings (often 768 or 1536 dimensions), it is mathematically impossible to collect enough data to ‘fill’ the space.

Solution
  • Cosine Similarity
  • Normalization
Cosine Similarity

Cosine similarity measures the cosine of the angle between 2 vectors.

\[\text{cos}(\theta) = \frac{A \cdot B}{\|A\|\|B\|} = \frac{\sum_{i=1}^{D} A_i B_i}{\sqrt{\sum_{i=1}^{D} A_i^2} \sqrt{\sum_{i=1}^{D} B_i^2}}\]

Note: Cosine similarity mitigates the ‘curse of dimensionality" problem.

Normalization

⭐️Normalize the vector, i.e, make its length =1, a unit vector.

💡By normalizing, we project all points onto the surface of a unit hypersphere.

  • We are no longer searching in the ‘empty’ high-dimensional volume of a hypercube.
  • Now, we are searching on a constrained manifold (the shell).

Note: By normalizing, we move the data from the volume of the D-dimensional space onto the surface of a (D-1)-dimensional hypersphere.

Euclidean Distance Squared of Normalized Vectors:

\[ \begin{align*} \|A - B\|^2 &= (A - B) \cdot (A - B) \\ &= \|A\|^2 + \|B\|^2 - 2(A \cdot B)\\ \because \|A\| &= \|B\| = 1 \\ \|A - B\|^2 &= 1 + 1 - 2\cos(\theta) \\ \therefore \|A - B\|^2 &= 2(1 - \cos(\theta))\\ \end{align*} \]

Note: This formula proves that maximizing ‘Cosine similarity’ is identical to minimizing ‘Euclidean distance’ on the hypersphere.



End of Section

4 - Bias Variance Tradeoff

Bias Variance Tradeoff

KNN Dataset

Let’s use this dataset to understand the impact of number of neighbours ‘k’.

images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_01_01.tif
High Bias, Low Variance

👉If ‘k’ is very large, say, k=n,

  • model simply predicts the majority class of the entire dataset for every query point , i.e, under-fitting.
High Variance, Low Bias

👉If ‘k’ is very small, say, k=1,

  • model is highly sensitive to noise or outliers, as it looks at only 1 nearest neighbor, i.e, over-fitting.
‘K' Hyper-Parameter Tuning

Let’s plot Error vs ‘K’ neighbors:

images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_04_01.tif
Over-Fitting Vs Under-Fitting
  • Figure 1: k=1, Over-fitting

  • Figure 2: k=n, Under-fitting

  • Figure 3: k=11, Lowest Error (Optimum)

    images/machine_learning/supervised/k_nearest_neighbors/bias_variance_tradeoff/slide_05_01.tif



End of Section