ML Project Lifecycle

Sneiderman, Robby

Methodology

ML Project Lifecycle

The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.

CoreTier 2CurrentSupporting~45 min

Prerequisites

Hardware for ML Practitioners

Prereq Map

Learning position

Read this page in the graph.

methodology | layer 1 | tier 2. This page has 1 direct prerequisite and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Train-Test Split and Data Leakage

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Most ML tutorials skip straight to model.fit(). In practice, model training is roughly 10-20% of the work. The rest is problem definition, data wrangling, evaluation design, deployment, and monitoring. Projects fail when practitioners treat ML as a modeling problem instead of an engineering problem with a modeling component.

Mental Model

An ML project is a pipeline with nine stages. Each stage can fail, and failures in early stages (especially data) propagate forward and corrupt everything downstream.

The Nine Stages

1. Problem Definition

Before writing any code: what decision does the model support? What is the cost of a wrong prediction? What is the baseline (human performance, simple heuristic, existing system)?

A classification model that achieves 95% accuracy is useless if the business requires 99.9% or if a simple rule already gets 94%.

2. Data Collection

Where does the data come from? How is it labeled? What are the selection biases? Is the labeling process reliable?

Common failure modes: labels are noisy (human annotators disagree), the data distribution shifts between collection and deployment, or the dataset is too small for the chosen model class.

3. Exploratory Data Analysis

Look at the data before modeling. Summary statistics, distributions, correlations, class balance, missing value patterns. This step prevents you from training a model on garbage.

4. Feature Engineering

Transform raw data into features the model can use. Domain knowledge matters more than model sophistication here. A good feature set with logistic regression often beats a neural network on raw features.

5. Model Selection

Choose a model class appropriate for the problem. Considerations: data size, feature type (tabular, image, text), latency requirements, interpretability needs. For tabular data with fewer than 10K rows, gradient boosting usually beats deep learning.

6. Training

Fit the model. This includes hyperparameter tuning, regularization choices, and convergence monitoring. Use a validation set to select hyperparameters. Never tune on the test set.

7. Evaluation

Measure performance on a held-out test set. Use metrics aligned with the business objective. Accuracy is rarely the right metric; precision, recall, F1, or calibration error are usually more informative.

8. Deployment

Serve the model in production. This involves model serialization, API design, latency optimization, and infrastructure provisioning. The gap between a Jupyter notebook and a production system is large.

9. Monitoring

Track model performance after deployment. Data distributions shift over time. A model trained on 2023 data may degrade on 2025 data. Detect drift, retrain on schedule, and maintain fallback systems.

Why Most Projects Fail at Data Quality

Proposition

Label Noise as a Practical Bottleneck

Statement

If both training and evaluation labels are corrupted at rate $\eta$ by the same labeling process, then the measured test accuracy of any classifier is upper-bounded by $1 - \eta$ relative to the noisy labels: even a perfect classifier of the clean distribution disagrees with the noisy labels on an $\eta$ fraction of inputs. Under the standard practice of treating noisy labels as ground truth, increasing model capacity or sample size cannot push observed test error below $\eta$ .

Intuition

If 10% of your held-out labels are wrong and you score against them, you cannot get below 10% disagreement with the held-out set even with the optimal classifier of the underlying truth. Improving label quality on both train and test typically dominates architectural changes.

Proof Sketch

Let $f^*$ be the Bayes classifier of the clean distribution. Under random label corruption at rate $\eta$ , $f^*$ disagrees with the noisy label on a fraction $\eta$ of examples by construction. Any other classifier $h$ has clean-population error $\geq R_{\text{true}}(f^*)$ , and its measured noisy-test error is at least $\eta$ under standard noise models (symmetric or class-conditional with rates summing to less than one).

Why It Matters

This is the practical reason most ML projects should spend a large share of effort on data quality. Switching architectures rarely matters if your evaluation labels themselves are 85% accurate.

Failure Mode

This is not an information-theoretic floor on what is achievable on the clean distribution. Under symmetric or known class-conditional noise, unbiased risk estimators and noise-corrected losses (Natarajan et al. 2013; Patrini et al. 2017) can asymptotically recover the clean Bayes classifier, so the clean-distribution error can be much less than $\eta$ . Under instance-dependent noise the picture is harder and the observable bottleneck above is the right intuition. The bound is also specifically about $0/1$ accuracy; calibration and ranking metrics can behave differently.

report a correction →

Cross-Functional Requirements

An ML system must satisfy requirements beyond accuracy:

Latency: prediction time per request. Real-time applications need $< 100$ ms. Batch applications can tolerate minutes.
Throughput: predictions per second. Scales with hardware and batching.
Cost: compute cost per prediction. Larger models cost more to serve.
Fairness: performance across demographic groups. A model with 95% overall accuracy but 70% accuracy on a minority group may be unacceptable.
Privacy: does the model leak training data? Differential privacy and federated learning address this.

MLOps Basics

Definition

MLOps

MLOps is the set of practices for deploying and maintaining ML models in production reliably and efficiently. It extends DevOps principles to ML systems, adding version control for data and models, experiment tracking, automated retraining, and model monitoring.

Key MLOps components:

CI/CD for models: automated testing of model quality on every code change. Tests include data validation, model performance regression tests, and integration tests.
Model registry: versioned storage of trained models with metadata (training data version, hyperparameters, metrics). Enables rollback.
A/B testing: serve the new model to a fraction of traffic and compare against the current model on production metrics.
Feature stores: centralized computation and serving of features, ensuring consistency between training and inference.

Common Confusions

Watch Out

ML projects are not software projects with a model inside

Standard software is deterministic: given the same input, it produces the same output. ML systems are stochastic, data-dependent, and degrade silently. Testing, deployment, and monitoring all require structurally different approaches.

Watch Out

More data is not always better

More noisy data can hurt. More data from a different distribution than your deployment target hurts. Data quality (correct labels, representative distribution) matters more than data quantity beyond a sufficient threshold.

Summary

Problem definition and data quality determine the ceiling; model choice determines how close you get to it
Evaluation must use metrics aligned with the actual business objective
Deployment and monitoring are where most engineering effort goes in production systems
The full lifecycle is iterative: monitoring reveals problems that send you back to data collection or feature engineering

Exercises

ExerciseCore

Problem

You are building a fraud detection system. The dataset has 1% fraud cases and 99% legitimate transactions. A model that always predicts "legitimate" achieves 99% accuracy. Why is this model useless, and what metric should you use instead?

ExerciseAdvanced

Problem

Your model achieves 92% accuracy on the test set, but after deployment, production accuracy drops to 84% within three months. List three possible causes and describe how you would diagnose each.

References

Canonical:

Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (2015), NeurIPS
Polyzotis et al., "Data Management Challenges in Production Machine Learning" (2017), SIGMOD

Current:

Google, "Rules of ML" (2023), Section on ML system design
Huyen, Designing Machine Learning Systems (2022), Chapters 1-3
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14

Next Topics

Train-test split and data leakage: preventing information contamination
Exploratory data analysis: understanding your data before modeling
Experiment tracking and tooling: managing ML experiments systematically

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Hardware for ML Practitionerslayer 1 · tier 2

Derived topics

3

Train-Test Split and Data Leakagelayer 1 · tier 1
Exploratory Data Analysislayer 1 · tier 2
Experiment Tracking and Toolinglayer 2 · tier 3

Graph-backed continuations

Train-Test Split and Data Leakage Exploratory Data Analysis Experiment Tracking and Tooling