KNN Introduction

K Nearest Neighbors Introduction

Issues with Linear/Logistic Regression
  • Parametric models:
    • Rely on assumption that relationships between data points are linear.
    • For polynomial regression we need to find the degree of polynomial.
  • Training:
    • We need to train 🏃‍♂️the model for prediction.
K Nearest Neighbors
  • Simple: Intuitive way to classify data or predict values by finding similar existing data points (neighbors).

  • Non-Parametric: Makes no assumptions about the underlying data distribution.

  • No Training Required: KNN is a ‘lazy learner’, it does not require a formal training 🏃‍♂️ phase.

    images/machine_learning/supervised/k_nearest_neighbors/knn_intro/slide_03_01.tif
    images/machine_learning/supervised/k_nearest_neighbors/knn_intro/slide_01_01.png
KNN Algorithm

Given a query point \(x_q\) and a dataset, D = {\((x_i,y_i)_{i=1}^n, \quad x_i,y_i \in \mathbb{R}^d\)}, the algorithm finds a set of ‘k’ nearest neighbors \(\mathcal{N}_k(x_q) \subseteq D\).

Inference:

  1. Choose a value of ‘k’ (hyper-parameter); odd number.
  2. Calculate distance (Euclidean, Cosine etc.) between and every point in dataset and store in a distance list.
  3. Sort the distance list in ascending order; choose top ‘k’ data points.
  4. Make prediction:
    • Classification: Take majority vote of ‘k’ nearest neighbors and assign label.
    • Regression: Take the mean/median of ‘k’ nearest neighbors.

Note: Store entire dataset.

Time & Space Complexity
  • Storing Data: Space Complexity: O(nd)
  • Inference: Time Complexity ⏰: O(nd + nlogn)

Explanation:

  • Distance to all ’n’ points in ‘d’ dimensions: O(nd)
  • Sorting all ’n’ data points : O(nlogn)

Note: Brute force 🔨 KNN is unacceptable when ’n’ is very large, say billions.



End of Section