This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

ML System

Machine Learning System

1: Data Distribution Shift
2: Retraining Strategies
3: Deployment Patterns

Machine Learning System | All Videos

End of Section

1 - Data Distribution Shift

Data Distribution Shift

Machine Learning System | All Videos

Distribution Shift or Data Drift 🦣

⭐️ The data a model works with changes over time ⏰, which causes this model’s predictions to become less accurate as time passes⏳.

Bayes' Theorem

\[P(Y|X)=\frac{P(X|Y)\cdot P(Y)}{P(X)}\]

P(X | Y): Likelihood of X given Y (joint distribution)
P(Y | X) : Model (Posterior)
P(Y): Prior probability of the output Y.
P(X): Evidence (marginal probability of the input X).

Covariate Shift (P(X) Changes)

⭐️The input data distribution seen during training is different from the distribution seen during inference.

👉 P(X)(input) changes, but P(Y|X) (model) remains same.

e.g. Self-driving car 🚗 trained on a bright, sunny day is used during foggy winter.

Label Shift or Prior Probability Shift (P(Y) Changes)

⭐️The output distribution changes, but for a given output, the input distribution remains the same.

👉 P(Y) (output) changes, but P(X|Y) remains the same.

😷 e.g. Flu-detection model is trained during summer, when only 1% of patients have flu.
- The same model is used during winter when 40% of patients have flu.
- 🍎 Prior probability of having flu P(Y) has changed from 1% to 40%, but the symptoms for a person to have flu P(X|Y) remains same.

Concept Drift or Posterior Shift (P(Y|X) Changes)

⭐️ The relationship between inputs and outputs changes.
i.e the very definition of what you are trying to predict changes.

👉 Concept drifts are cyclic or seasonal.

e.g. ‘Normal’ spending behavior in 2019 became ‘Abnormal’ during 2020 lockdowns 🔐.

Data Distribution Shift | Covariate Shift | Label Shift | Concept Drift | Explained with Example

Previous: Model Interpretability Next: Retraining Strategies

End of Section

2 - Retraining Strategies

Retraining Strategies

Machine Learning System | All Videos

Why Retrain 🦮 a ML Model?

⭐️In a production ML environment, retraining is the ‘maintenance engine’ ⚙️ that keeps our models from becoming obsolete.

❌ Don’t ask: When do we retrain?

✅ Ask: “How do we automate the decision to retrain while balancing compute cost 💰, model risk, and data freshness?”

Periodic Retraining (Fixed Interval) ⏳

👉 The model is retrained on a regular schedule (e.g., daily, weekly, or monthly).

Best for:
- Stable environments where data changes slowly.
  (e.g. long-term demand forecast or a credit scoring model).
Pros:
- Highly predictable; easy to schedule compute resources; simple to implement via a cron job or Airflow DAG.
Cons:
- Inefficient. You might retrain when not needed (wasting money 💵) or fail to retrain during a sudden market shift (losing accuracy).

Trigger-Based Retraining (Reactive) 🔫

👉 Retraining is initiated only when a specific performance or data metric crosses a pre-defined threshold.

Metric Triggers:
- Performance Decay: A drop in Precision, Recall, or RMSE (requires ground-truth labels).
- Drift Detection: A high PSI (Population Stability Index) or K-S test score indicating covariate shift.
Pros:
- Cost-effective; reacts to the ‘reality’ of the data rather than the calendar.
Cons:
- Requires a robust monitoring stack 📺.
  If the ‘trigger’ logic is buggy, the model may never update.

Continual Learning (Online/Incremental) 🛜

👉 Instead of retraining from scratch on a massive batch, the model is updated incrementally as new data ‘streams’ into the system.

Mechanism: Using ‘Warm Starts’ where the model weights from the previous version are used as the starting point for the next few gradient descent steps.
Best for:
- Recommendation engines (Netflix/TikTok) or High-Frequency Trading 💰where patterns change by the minute.
Pros:
- Extreme ‘freshness’; low latency between data arrival and model update.
Cons:
- High risk of ‘Catastrophic Forgetting’ (the model forgets old patterns) and high infrastructure complexity.

Model Retraining Strategies | Periodic Retraining | Trigger- Based Retraining | Continual Learning

Previous: Data Distribution Shift Next: Deployment Patterns

End of Section

3 - Deployment Patterns

Deployment Patterns

Machine Learning System | All Videos

Deploy 🖥️

⭐️In a production ML environment, retraining is only half the battle, we must also safely deploy the new version.

Types of deployment (most common):

Shadow ❏ Deployment
A/B Testing 🧪
Canary 🦜 Deployment

Shadow ❏ Deployment

👉 Safest way to deploy our model or any software update.

Deploy the candidate model in parallel with the existing model.
For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user.
Log the predictions from the new model for analysis purposes.

Note: When the new model’s predictions are satisfactory, we replace the existing model with the new model.

images/machine_learning/ml_system/deployment_patterns/slide_03_01.png

A/B Testing 🧪

👉A/B testing is a way to compare two variants of a model.

Deploy the candidate model in parallel with the existing model.
A percentage of traffic🚦is routed to the candidate for predictions; the rest is routed to the existing model for predictions.
Monitor 📺 and analyze the predictions, from both models to determine whether the difference in the two models’ performance is statistically significant.

Note: Say we run a two-sample test and get the result that model A is better than model B with the p-value of p = 0.05 or 5%.

images/machine_learning/ml_system/deployment_patterns/slide_05_01.png

Canary 🦜 Deployment

👉 Mitigates deployment risk by incrementally shifting traffic 🚦from a model version to a new version, allowing for real-world validation on a subset of users before a full-scale rollout.

Deploy the candidate model in parallel with the existing model.
A percentage of traffic🚦is routed to the candidate for predictions.
If its performance is satisfactory, increase the traffic to the candidate model.If not, abort the canary and route all the traffic🚦 back to the existing model.
Stop when either the canary serves all the traffic🚦 (the candidate model has replaced the existing model) or when the canary is aborted.

Note: Canary releases can be used to implement A/B testing due to the similarities in their setups. However, we can do canary analysis without A/B testing.

images/machine_learning/ml_system/deployment_patterns/canary_deployment.png

Deployment Patterns in ML | Shadow Deployment | A/B Testing | Canary Deployment | Explained

Previous: Retraining Strategies Next: Introduction to Probability

End of Section