This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

ML System

Machine Learning System

1 - Data Distribution Shift

Data Distribution Shift

Distribution Shift or Data Drift 🦣
⭐️ The data a model works with changes over time ⏰, which causes this model’s predictions to become less accurate as time passes⏳.
Bayes' Theorem
\[P(Y|X)=\frac{P(X|Y)\cdot P(Y)}{P(X)}\]
  • P(X | Y): Likelihood of X given Y (joint distribution)
  • P(Y | X) : Model (Posterior)
  • P(Y): Prior probability of the output Y.
  • P(X): Evidence (marginal probability of the input X).
Covariate Shift (P(X) Changes)

⭐️The input data distribution seen during training is different from the distribution seen during inference.

👉 P(X)(input) changes, but P(Y|X) (model) remains same.

  • e.g. Self-driving car 🚗 trained on a bright, sunny day is used during foggy winter.
Label Shift or Prior Probability Shift (P(Y) Changes)

⭐️The output distribution changes, but for a given output, the input distribution remains the same.

👉 P(Y) (output) changes, but P(X|Y) remains the same.

  • 😷 e.g. Flu-detection model is trained during summer, when only 1% of patients have flu.
    • The same model is used during winter when 40% of patients have flu.
    • 🍎 Prior probability of having flu P(Y) has changed from 1% to 40%, but the symptoms for a person to have flu P(X|Y) remains same.
Concept Drift or Posterior Shift (P(Y|X) Changes)

⭐️ The relationship between inputs and outputs changes.
i.e the very definition of what you are trying to predict changes.

👉 Concept drifts are cyclic or seasonal.

  • e.g. ‘Normal’ spending behavior in 2019 became ‘Abnormal’ during 2020 lockdowns 🔐.



End of Section

2 - Retraining Strategies

Retraining Strategies

Why Retrain 🦮 a ML Model?

⭐️In a production ML environment, retraining is the ‘maintenance engine’ ⚙️ that keeps our models from becoming obsolete.

❌ Don’t ask: When do we retrain?

✅ Ask: “How do we automate the decision to retrain while balancing compute cost 💰, model risk, and data freshness?”

Periodic Retraining (Fixed Interval) ⏳

👉 The model is retrained on a regular schedule (e.g., daily, weekly, or monthly).

  • Best for:
    • Stable environments where data changes slowly.
      (e.g. long-term demand forecast or a credit scoring model).
  • Pros:
    • Highly predictable; easy to schedule compute resources; simple to implement via a cron job or Airflow DAG.
  • Cons:
    • Inefficient. You might retrain when not needed (wasting money 💵) or fail to retrain during a sudden market shift (losing accuracy).
Trigger-Based Retraining (Reactive) 🔫

👉 Retraining is initiated only when a specific performance or data metric crosses a pre-defined threshold.

  • Metric Triggers:
    • Performance Decay: A drop in Precision, Recall, or RMSE (requires ground-truth labels).
    • Drift Detection: A high PSI (Population Stability Index) or K-S test score indicating covariate shift.
  • Pros:
    • Cost-effective; reacts to the ‘reality’ of the data rather than the calendar.
  • Cons:
    • Requires a robust monitoring stack 📺.
      If the ‘trigger’ logic is buggy, the model may never update.
Continual Learning (Online/Incremental) 🛜

👉 Instead of retraining from scratch on a massive batch, the model is updated incrementally as new data ‘streams’ into the system.

  • Mechanism: Using ‘Warm Starts’ where the model weights from the previous version are used as the starting point for the next few gradient descent steps.
  • Best for:
    • Recommendation engines (Netflix/TikTok) or High-Frequency Trading 💰where patterns change by the minute.
  • Pros:
    • Extreme ‘freshness’; low latency between data arrival and model update.
  • Cons:
    • High risk of ‘Catastrophic Forgetting’ (the model forgets old patterns) and high infrastructure complexity.



End of Section

3 - Deployment Patterns

Deployment Patterns

Deploy 🖥️

⭐️In a production ML environment, retraining is only half the battle, we must also safely deploy the new version.

Types of deployment (most common):

  • Shadow ❏ Deployment
  • A/B Testing 🧪
  • Canary 🦜 Deployment
Shadow ❏ Deployment

👉 Safest way to deploy our model or any software update.

  • Deploy the candidate model in parallel with the existing model.
  • For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user.
  • Log the predictions from the new model for analysis purposes.

Note: When the new model’s predictions are satisfactory, we replace the existing model with the new model.

images/machine_learning/ml_system/deployment_patterns/slide_03_01.png
A/B Testing 🧪

👉A/B testing is a way to compare two variants of a model.

  • Deploy the candidate model in parallel with the existing model.
  • A percentage of traffic🚦is routed to the candidate for predictions; the rest is routed to the existing model for predictions.
  • Monitor 📺 and analyze the predictions, from both models to determine whether the difference in the two models’ performance is statistically significant.

Note: Say we run a two-sample test and get the result that model A is better than model B with the p-value of p = 0.05 or 5%.

images/machine_learning/ml_system/deployment_patterns/slide_05_01.png
Canary 🦜 Deployment

👉 Mitigates deployment risk by incrementally shifting traffic 🚦from a model version to a new version, allowing for real-world validation on a subset of users before a full-scale rollout.

  • Deploy the candidate model in parallel with the existing model.
  • A percentage of traffic🚦is routed to the candidate for predictions.
  • If its performance is satisfactory, increase the traffic to the candidate model.If not, abort the canary and route all the traffic🚦 back to the existing model.
  • Stop when either the canary serves all the traffic🚦 (the candidate model has replaced the existing model) or when the canary is aborted.

Note: Canary releases can be used to implement A/B testing due to the similarities in their setups. However, we can do canary analysis without A/B testing.

images/machine_learning/ml_system/deployment_patterns/canary_deployment.png



End of Section