Data Pre Processing
Data Pre Processing
2 minute read
Real World Data
Messy and Incomplete.
We need to pre-process the data to make it:
- Clean
- Consistent
- Mathematically valid
- Computationally stable
👉 So that, the machine learning algorithm can safely consume the data.
Missing Values
- Missing Completely At Random (MCAR)
- The missingness occurs entirely by chance, such as due to a technical glitch during data collection or a random human error in data entry.
- Missing At Random (MAR)
- The probability of missingness depends on the observed data and not on the missing value itself.
- e.g. In some survey, the age of many females are missing, because they may not like to disclose the information.
- Missing Not At Random (MNAR)
- The probability of missingness is directly related to the unobserved missing value itself.
- e.g. Individuals with very high incomes may intentionally refuse to report their salary due to privacy concerns, making the missing data directly dependent on the high income value itself.
Handle Missing Values (Imputation)
- Simple Imputation:
- Mean: Normally distributed numerical features.
- Median: Skewed numerical features.
- Mode: Categorical features, most frequent.
- KNN Imputation:
- Replace the missing value with mean/median/mode of ‘k’ nearest (similar) neighbors of the missing value.
- Predictive Imputation:
- Use another ML model to estimate missing values.
- Multivariate Imputation by Chained Equations (MICE):
- Iteratively models each variable with missing values as a function of other variables using flexible regression models (linear regression, logistic regression, etc.) in a ‘chained’ or sequential process.
- Creates multiple datasets, using slightly different random starting points.
Handle Outliers
🦄 Outliers are extreme or unusual data points, can mislead models, causing inaccurate predictions.
- Remove invalid or corrupted data.
- Replace (Impute): Median or capped value to reduce impact.
- Transform: Apply log or square root to reduce skew.
👉 For example: Log and Square Root Transformed Data

Scaling and Normalization
💡 If one feature ranges from 0-1 and another from 0-1000, larger feature will dominate the model.
- Standardization (Z-score) :
- μ=0, σ=1; less sensitive to outliers.
- \(x_{std} = (x − μ) / σ\)
- Min-Max Normalization:
- Maps data to specific range, typically [0,1]; sensitive to outliers.
- \(x_{minmax} = (x − min) / (max − min)\)
- Robust Scaling:
- Transforms features using median and IQR; resilient to outliers.
- \(x_{scaled}=(x-\text{median})/\text{IQR}\)
👉 Standardization Example

End of Section