Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

ML Project Lifecycle

The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.

CoreTier 2Current~45 min
0

Why This Matters

Most ML tutorials skip straight to model.fit(). In practice, model training is roughly 10-20% of the work. The rest is problem definition, data wrangling, evaluation design, deployment, and monitoring. Projects fail when practitioners treat ML as a modeling problem instead of an engineering problem with a modeling component.

Mental Model

An ML project is a pipeline with nine stages. Each stage can fail, and failures in early stages (especially data) propagate forward and corrupt everything downstream.

The Nine Stages

1. Problem Definition

Before writing any code: what decision does the model support? What is the cost of a wrong prediction? What is the baseline (human performance, simple heuristic, existing system)?

A classification model that achieves 95% accuracy is useless if the business requires 99.9% or if a simple rule already gets 94%.

2. Data Collection

Where does the data come from? How is it labeled? What are the selection biases? Is the labeling process reliable?

Common failure modes: labels are noisy (human annotators disagree), the data distribution shifts between collection and deployment, or the dataset is too small for the chosen model class.

3. Exploratory Data Analysis

Look at the data before modeling. Summary statistics, distributions, correlations, class balance, missing value patterns. This step prevents you from training a model on garbage.

4. Feature Engineering

Transform raw data into features the model can use. Domain knowledge matters more than model sophistication here. A good feature set with logistic regression often beats a neural network on raw features.

5. Model Selection

Choose a model class appropriate for the problem. Considerations: data size, feature type (tabular, image, text), latency requirements, interpretability needs. For tabular data with fewer than 10K rows, gradient boosting usually beats deep learning.

6. Training

Fit the model. This includes hyperparameter tuning, regularization choices, and convergence monitoring. Use a validation set to select hyperparameters. Never tune on the test set.

7. Evaluation

Measure performance on a held-out test set. Use metrics aligned with the business objective. Accuracy is rarely the right metric; precision, recall, F1, or calibration error are usually more informative.

8. Deployment

Serve the model in production. This involves model serialization, API design, latency optimization, and infrastructure provisioning. The gap between a Jupyter notebook and a production system is large.

9. Monitoring

Track model performance after deployment. Data distributions shift over time. A model trained on 2023 data may degrade on 2025 data. Detect drift, retrain on schedule, and maintain fallback systems.

Why Most Projects Fail at Data Quality

Proposition

Data Quality Dominance

Statement

For a fixed model architecture and training procedure, the achievable test error is lower-bounded by the label noise rate η\eta. If labels are wrong with probability η\eta, no model can achieve population accuracy greater than 1η1 - \eta on the correctly-labeled distribution, regardless of sample size.

Intuition

If 10% of your labels are wrong, you cannot get below 10% error (on the correctly-labeled distribution) by changing the model. Improving label quality from 90% to 99% does more than any model architecture change.

Proof Sketch

Let ff^* be the Bayes-optimal classifier for the true distribution. A model trained on noisy labels minimizes risk under the noisy distribution. The population risk under the true distribution satisfies Rtrue(h)ηR_{\text{true}}(h) \geq \eta for any hh that fits the noisy labels, because the model is incentivized to match the noise.

Why It Matters

This is why most ML projects should spend 60-80% of their effort on data: collection, cleaning, labeling, and validation. Switching from ResNet to Vision Transformer rarely matters if your labels are 85% accurate.

Failure Mode

This bound assumes the noise is label-dependent and adversarial to the true task. Symmetric label noise (uniform random flips) can sometimes be corrected with noise-robust loss functions. The bound is tighter when noise is class-conditional and correlated with features.

Cross-Functional Requirements

An ML system must satisfy requirements beyond accuracy:

  • Latency: prediction time per request. Real-time applications need <100< 100 ms. Batch applications can tolerate minutes.
  • Throughput: predictions per second. Scales with hardware and batching.
  • Cost: compute cost per prediction. Larger models cost more to serve.
  • Fairness: performance across demographic groups. A model with 95% overall accuracy but 70% accuracy on a minority group may be unacceptable.
  • Privacy: does the model leak training data? Differential privacy and federated learning address this.

MLOps Basics

Definition

MLOps

MLOps is the set of practices for deploying and maintaining ML models in production reliably and efficiently. It extends DevOps principles to ML systems, adding version control for data and models, experiment tracking, automated retraining, and model monitoring.

Key MLOps components:

  • CI/CD for models: automated testing of model quality on every code change. Tests include data validation, model performance regression tests, and integration tests.
  • Model registry: versioned storage of trained models with metadata (training data version, hyperparameters, metrics). Enables rollback.
  • A/B testing: serve the new model to a fraction of traffic and compare against the current model on production metrics.
  • Feature stores: centralized computation and serving of features, ensuring consistency between training and inference.

Common Confusions

Watch Out

ML projects are not software projects with a model inside

Standard software is deterministic: given the same input, it produces the same output. ML systems are stochastic, data-dependent, and degrade silently. Testing, deployment, and monitoring all require structurally different approaches.

Watch Out

More data is not always better

More noisy data can hurt. More data from a different distribution than your deployment target hurts. Data quality (correct labels, representative distribution) matters more than data quantity beyond a sufficient threshold.

Key Takeaways

  • Problem definition and data quality determine the ceiling; model choice determines how close you get to it
  • Evaluation must use metrics aligned with the actual business objective
  • Deployment and monitoring are where most engineering effort goes in production systems
  • The full lifecycle is iterative: monitoring reveals problems that send you back to data collection or feature engineering

Exercises

ExerciseCore

Problem

You are building a fraud detection system. The dataset has 1% fraud cases and 99% legitimate transactions. A model that always predicts "legitimate" achieves 99% accuracy. Why is this model useless, and what metric should you use instead?

ExerciseAdvanced

Problem

Your model achieves 92% accuracy on the test set, but after deployment, production accuracy drops to 84% within three months. List three possible causes and describe how you would diagnose each.

References

Canonical:

  • Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (2015), NeurIPS
  • Polyzotis et al., "Data Management Challenges in Production Machine Learning" (2017), SIGMOD

Current:

  • Google, "Rules of ML" (2023), Section on ML system design

  • Huyen, Designing Machine Learning Systems (2022), Chapters 1-3

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Last reviewed: April 2026

Next Topics