1 - Data Pre Processing

Data Pre Processing
Real World 🌎 Data

Messy and Incomplete.
We need to pre-process the data to make it:

  • Clean
  • Consistent
  • Mathematically valid
  • Computationally stable

πŸ‘‰ So that, the machine learning algorithm can safely consume the data.

Missing Values
  • Missing Completely At Random (MCAR)
    • The missingness occurs entirely by chance, such as due to a technical glitch during data collection or a random human error in data entry.
  • Missing At Random (MAR)
    • The probability of missingness depends on the observed data and not on the missing value itself.
    • e.g. In some survey, the age of many females are missing, because they may not like to disclose the information.
  • Missing Not At Random (MNAR)
    • The probability of missingness is directly related to the unobserved missing value itself.
    • e.g. Individuals with very high incomes πŸ’°may intentionally refuse to report their salary due to privacy concerns, making the missing data directly dependent on the high income πŸ’°value itself.
Handle Missing Values (Imputation)
  • Simple Imputation:
    • Mean: Normally distributed numerical features.
    • Median: Skewed numerical features.
    • Mode: Categorical features, most frequent.
  • KNN Imputation:
    • Replace the missing value with mean/median/mode of β€˜k’ nearest (similar) neighbors of the missing value.
  • Predictive Imputation:
    • Use another ML model to estimate missing values.
  • Multivariate Imputation by Chained Equations (MICE):
    • Iteratively models each variable with missing values as a function of other variables using flexible regression models (linear regression, logistic regression, etc.) in a ‘chained’ or sequential process.
    • Creates multiple datasets, using slightly different random starting points.
Handle Outliers πŸ¦„

πŸ¦„ Outliers are extreme or unusual data points, can mislead models, causing inaccurate predictions.

  • Remove invalid or corrupted data.
  • Replace (Impute): Median or capped value to reduce impact.
  • Transform: Apply log or square root to reduce skew.

πŸ‘‰ For example: Log and Square Root Transformed Data

images/machine_learning/feature_engineering/data_pre_processing/slide_05_01.png
Scaling and Normalization

πŸ’‘ If one feature ranges from 0-1 and another from 0-1000, larger feature will dominate the model.

  • Standardization (Z-score) :
    • ΞΌ=0, Οƒ=1; less sensitive to outliers.
    • \(x_{std} = (x βˆ’ ΞΌ) / Οƒ\)
  • Min-Max Normalization:
    • Maps data to specific range, typically [0,1]; sensitive to outliers.
    • \(x_{minmax} = (x βˆ’ min) / (max βˆ’ min)\)
  • Robust Scaling:
    • Transforms features using median and IQR; resilient to outliers.
    • \(x_{scaled}=(x-\text{median})/\text{IQR}\)

πŸ‘‰ Standardization Example

images/machine_learning/feature_engineering/data_pre_processing/slide_07_01.png

End of Section

2 - Categorical Variables

Categorical Variables
Categorical Variables

πŸ’‘ ML models operate on numerical vectors.

πŸ‘‰ Categorical variables must be transformed (encoded) while preserving information and semantics.

  • One Hot Encoding (OHE)
  • Label Encoding
  • Ordinal Encoding
  • Frequency/Count Encoding
  • Target Encoding
  • Hash Encoding
One Hot πŸ”₯ Encoding (OHE)

⭐️ When the categorical data (nominal) is without any inherent ordering.

  • Create binary columns per category.
    • e.g.: Colors: Red, Blue, Green.
    • Colors: [1,0,0], [0,1,0], [0,0,1]

Note: Use when low cardinality, or small number of unique values (<20).

Label 🏷️ Encoding

⭐️ Assigns a unique integer (e.g., 0, 1, 2) to each category.

  • When to use ?
    • Target variable, i.e, unordered (nominal) data, in classification problems.
    • e.g. encoding a city [“Paris”, “Tokyo”, “Amsterdam”] -> [1, 2, 0], (Alphabetical: Amsterdam=0, Paris=1, Tokyo=2).
  • When to avoid ?
    • For nominal data in linear models, because it can mislead the model to assume an order/hierarchy, when there is none.
Ordinal Encoding

⭐️ When categorical data has logical ordering.

  • Best for: Ordered (ordinal) input features.

    images/machine_learning/feature_engineering/categorical_variables/slide_04_01.png
Frequency/Count πŸ“Ÿ Encoding

⭐️ Replace categories with their frequency or count in the dataset.

  • Useful for high-cardinality features where many unique values exist.

πŸ‘‰ Example

images/machine_learning/feature_engineering/categorical_variables/slide_06_01.png

πŸ‘‰ Frequency of Country

images/machine_learning/feature_engineering/categorical_variables/slide_06_03.png

πŸ‘‰ Country replaced with Frequency

images/machine_learning/feature_engineering/categorical_variables/slide_06_02.png
Target 🎯 Encoding

⭐️ Replace a category with the mean of the target variable for that specific category.

  • When to use ?
    • For high-cardinality nominal features, where one hot encoding is inefficient, e.g., zip code, product id, etc.
    • Strong correlation between the category and the target variable.
  • When to avoid ?
    • With small datasets, because the category averages (encodings) are based on too few samples, making them unrepresentative.
    • Also, it can lead to target leakage and overfitting unless proper smoothing or cross-validation techniques (like K-fold or Leave-One-Out) are used.
Hash 🌿 Encoding

⭐️ Maps categories to a fixed number of features using a hash function.

  • Useful for high-cardinality features where we want to limit the dimensionality.

    images/machine_learning/feature_engineering/categorical_variables/slide_09_01.png

End of Section

3 - Feature Engineering

Feature Engineering
Feature Engineering
Use domain knowledge πŸ“• to create new or transform existing features to improve model performance.
Polynomial πŸ™ Features

Create polynomial features, such as, x^2, x^3, etc., to learn non-linear relationship.

images/machine_learning/feature_engineering/feature_engineering/slide_04_01.png
Feature Crossing πŸ¦“

⭐️ Combine 2 or more features to capture non-linear relationship.

  • e.g. combine latitude and longitude into one location feature β€˜lat-long'.
Hash 🌿 Encoding

⭐️ Memory-efficient 🧠 technique to convert categorical (string) data into a fixed-size numerical feature vector.

  • Pros:
    • Useful for high-cardinality features where we want to limit the dimensionality.
  • Cons:
    • Hash collisions.
    • Reduced interpretability.

πŸ‘‰ Hash Encoding (Example)

images/machine_learning/feature_engineering/feature_engineering/slide_08_01.png
Binning (Discretization)

⭐️ Group continuous numerical values into discrete categories or ‘bin’.

  • e.g. divide age into groups 18-24, 25-35, 35-45, 45-55, >55 years etc.

End of Section

4 - Data Leakage

Data Leakage
Data Leakage

⭐️ Occurs when a model is trained using data that would not be available during real-world predictions, leading to good training performance, but poor real‑world 🌎 performance.
It is essentially the model ‘cheating’ by inadvertently accessing information about the target variable.

πŸ‘‰Any information from the validation/test set must NOT influence training, directly or indirectly.
❓So, how do we prevent this leakage of information or data leakage from training to validation or test set ?

Train-Test Contamination
  • ❌ Wrong: Applying preprocessing (like global StandardScaler, Mean_Imputation, Target_Encoding etc.) on the entire dataset before splitting.
  • βœ… Right: Compute mean, variance, etc. only on the training data and use the same for validation and test data.

Preventing Leakage in Cross-Validation:

  • ❌ Wrong: Perform preprocessing (e.g., scaling, normalization, missing value imputation) on the entire dataset before passing it to cross_val_score.
  • βœ… Right: Use sklearn.pipeline.Pipeline; Pipeline ensures that the ‘validation fold’ remains unseen until the transformation is applied using the training fold’s parameters.
Temporal Leakage

This happens in Time Series ⏰ data.

  • ❌ Wrong: Use standard random CV; it allows the model to ‘peek into the future’.
  • βœ… Right: Use Time-Series Nested Cross-Validation (Forward Chaining) instead of random shuffling.
Target Leakage
  • ❌ Wrong: Include features that are only available after the event we are trying to predict and are proxy for the target.
    • e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan will default ?
  • βœ… Right: Do not include such features during training.

Group Leakage:

  • ❌ Wrong: If you have multiple rows that are correlated (same user).
    • For the same patient or user, you put some rows in Train and others in Test.
  • βœ… Right: Use GroupKFold to ensure all data from a specific group stays together in one fold.

End of Section

5 - Model Interpretability

Model Interpretability
House Price Prediction
images/machine_learning/feature_engineering/model_interpretability/slide_01_01.png
Can we explain why the model made a certain prediction ?

πŸ‘‰ Because without this capability the machine learning is like a black box to us.

πŸ‘‰ We should be able to answer which features had most influence on output.

⭐️ Let’s understand ‘Feature Importance’ and why the ML model output’s interpretability is important ?

Feature Importance
\[\hat{y_i} = w_0 + w_1x_{i_1} + w_2x_{i_2} + \dots + w_dx_{i_d}\]\[w_1 > w_2 : f_1 \text{ is more important feature than } f_2\]\[ \begin{align*} w_j &> 0: f_j \text { is directly proportional to target variable} \\ w_j &= 0: f_j \text { has no relation to target variable} \\ w_j &< 0: f_j \text { is inversely proportional to target variable} \\ \end{align*} \]

Note: Weights πŸ‹οΈβ€β™€οΈ represent the importance of feature with standardized data.

Why Model Interpretability Matters ?

πŸ’‘ Overall model behavior + Why this prediction?

  • Trust: Stakeholders must trust predictions.
  • Model Debuggability: Detect leakage, spurious correlations.
  • Feature engineering: Feedback loop.
  • Regulatory compliance: Data privacy, GDPR.
Trust

⭐️ Stakeholders Must Trust Predictions.

  • Users, executives, and clients are more likely to trust and adopt an AI system if they understand its reasoning.
  • This transparency is fundamental, especially in high-stakes applications like healthcare, finance, or law, where decisions can have a significant impact.
Model Debuggability
⭐️ By examining which features influence predictions, developers can identify if the model is using misleading or spurious correlations, or if there is data leakage (where information not available in a real-world scenario is used during training).
Feature Engineering
⭐️ Insights gained from an interpretable model can provide a valuable feedback loop for domain experts and engineers.
Regulatory Compliance

⭐️ In many industries, regulations mandate the ability to explain decisions made by automated systems.

  • For instance, the General Data Protection Regulation (GDPR) in Europe includes a “right to explanation” for individuals affected by algorithmic decisions.
  • Interpretability ensures that organizations can meet these legal and ethical requirements.

End of Section