Data Pre Processing

2 minute read

Real World 🌎 Data

Messy and Incomplete.
We need to pre-process the data to make it:

👉 So that, the machine learning algorithm can safely consume the data.

Missing Values

Missing Completely At Random (MCAR)
- The missingness occurs entirely by chance, such as due to a technical glitch during data collection or a random human error in data entry.
Missing At Random (MAR)
- The probability of missingness depends on the observed data and not on the missing value itself.
- e.g. In some survey, the age of many females are missing, because they may not like to disclose the information.
Missing Not At Random (MNAR)
- The probability of missingness is directly related to the unobserved missing value itself.
- e.g. Individuals with very high incomes 💰may intentionally refuse to report their salary due to privacy concerns, making the missing data directly dependent on the high income 💰value itself.

Handle Missing Values (Imputation)

Simple Imputation:
- Mean: Normally distributed numerical features.
- Median: Skewed numerical features.
- Mode: Categorical features, most frequent.
KNN Imputation:
- Replace the missing value with mean/median/mode of ‘k’ nearest (similar) neighbors of the missing value.
Predictive Imputation:
- Use another ML model to estimate missing values.
Multivariate Imputation by Chained Equations (MICE):
- Iteratively models each variable with missing values as a function of other variables using flexible regression models (linear regression, logistic regression, etc.) in a ‘chained’ or sequential process.
- Creates multiple datasets, using slightly different random starting points.

Handle Outliers 🦄

🦄 Outliers are extreme or unusual data points, can mislead models, causing inaccurate predictions.

👉 For example: Log and Square Root Transformed Data

Scaling and Normalization

💡 If one feature ranges from 0-1 and another from 0-1000, larger feature will dominate the model.

Standardization (Z-score) :
- μ=0, σ=1; less sensitive to outliers.
- \(x_{std} = (x − μ) / σ\)
Min-Max Normalization:
- Maps data to specific range, typically [0,1]; sensitive to outliers.
- \(x_{minmax} = (x − min) / (max − min)\)
Robust Scaling:
- Transforms features using median and IQR; resilient to outliers.
- \(x_{scaled}=(x-\text{median})/\text{IQR}\)

👉 Standardization Example

End of Section