← Perspectives/Article
Data ScienceMLOpsForecasting

ML Forecasting in Production: The Gap Between Notebook and Reality

10 July 20247 min read

The Notebook Isn't the System

There's a pattern we see repeatedly in organisations that have invested in data science capability: excellent models that never make it to production, or that make it to production and then quietly degrade until they're no longer used.

The root cause is consistent: the notebook is treated as the deliverable.

It isn't.

The notebook is evidence that a model can work. The production system is what makes it actually work — reliably, repeatably, and with the visibility needed to know when it's stopped working.

What Production Forecasting Actually Requires

A forecast that lives in production — one that operations teams rely on for headcount planning, inventory decisions, or capacity management — needs five things the notebook doesn't have:

1. Automated, scheduled execution.

Someone can't run the notebook manually every Monday morning. The pipeline needs to run on a schedule, consume fresh data, generate fresh forecasts, and write them to wherever they need to be — without human involvement.

2. Data quality validation.

What happens when the input data is missing three days because of an upstream outage? What happens when a feature distribution shifts because a product changed how events are logged? The model needs to fail loudly when inputs are bad, not silently produce garbage forecasts.

3. Model performance monitoring.

Forecast accuracy at week of go-live is a lagging indicator of model quality — what matters is accuracy six months later, when the data distribution has shifted, new seasonal patterns have emerged, and the model hasn't been retrained since launch.

4. Automated retraining.

Not necessarily continuous learning — but a defined trigger for when the model should be retrained on new data. For most operational forecasting use cases, this is a rolling window approach: when rolling 4-week MAPE exceeds a threshold, retrain on the most recent N months of data.

5. A clear failure mode.

What does the system do when the model pipeline fails entirely? In most organisations, the answer should be: fall back to a simpler baseline (moving average, last year same period) and alert the team. A degraded forecast is usually better than no forecast — but only if the operations team knows it's degraded.

A Case From the Field

In our Volume Forecasting engagement with a global travel organisation, the model development phase — exploration, feature engineering, ensemble model selection — took about 6 weeks.

The production engineering took another 8 weeks.

That ratio surprises clients. It shouldn't.

The production system included:

  • Daily Airflow DAGs ingesting volume data from 4 source systems, with validation checks at each step
  • AWS SageMaker containerised model serving with a daily retraining pipeline
  • Drift detection monitoring feature distributions week-over-week
  • MLflow experiment tracking with full model versioning and audit trail
  • A Power BI operations dashboard showing rolling MAPE by product line, region, and forecast horizon
  • Automated retraining triggered when 4-week rolling MAPE exceeded 12%

Six months post-launch, forecast accuracy was still above 85%. The model had retrained 11 times. Zero manual interventions.

That's what production looks like.

The Tooling Choice Matters Less Than You Think

Every ML infrastructure conversation eventually becomes a debate about tools: SageMaker vs. Vertex AI, MLflow vs. Weights & Biases, Airflow vs. Prefect.

These choices matter — but less than the practices around them.

We've seen excellent production forecasting systems built on relatively simple infrastructure, and we've seen sophisticated MLOps platforms that nobody trusted because the monitoring wasn't set up correctly.

The practices that matter most:

  • Logging everything. Every prediction, every input feature distribution snapshot, every retraining event. You will need this data when something goes wrong.
  • Defining accuracy thresholds before go-live. What's acceptable MAPE for this use case? What triggers a retraining? Agree this with stakeholders before launch, not after the first bad week.
  • Building for explainability from day one. SHAP values or feature importance aren't a nice-to-have — they're how operations teams build trust in the model. If they can't see why the model predicted what it predicted, they won't use the output when it contradicts their instinct.

The Commercial Value of Reliability

There's a reason organisations invest in production ML infrastructure that seems disproportionate to the underlying model complexity.

A forecast that's 85% accurate and reliable is worth dramatically more than a forecast that's 90% accurate for the first month and then degrades to 70% because nobody is monitoring it.

For the travel organisation, the value wasn't the model — it was the 14-day planning horizon they could now operate with confidence. That confidence came from consistency. And consistency came from the engineering around the model, not the model itself.


DataGravity builds and deploys production ML forecasting systems across operational planning, workforce management, and risk contexts. Contact us if you're facing the gap between a working notebook and a reliable production system.

[MORE PERSPECTIVES]

Read more practitioner writing on data engineering and analytics.

← All Articles

Facing this challenge
in your organisation?

Let's talk about your specific situation — not a generic deck.

Start a Conversation →