Methodology
ML Project Lifecycle
The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.
Why This Matters
Most ML tutorials skip straight to model.fit(). In practice, model
training is roughly 10-20% of the work. The rest is problem definition,
data wrangling, evaluation design, deployment, and monitoring. Projects
fail when practitioners treat ML as a modeling problem instead of an
engineering problem with a modeling component.
Mental Model
An ML project is a pipeline with nine stages. Each stage can fail, and failures in early stages (especially data) propagate forward and corrupt everything downstream.
The Nine Stages
1. Problem Definition
Before writing any code: what decision does the model support? What is the cost of a wrong prediction? What is the baseline (human performance, simple heuristic, existing system)?
A classification model that achieves 95% accuracy is useless if the business requires 99.9% or if a simple rule already gets 94%.
2. Data Collection
Where does the data come from? How is it labeled? What are the selection biases? Is the labeling process reliable?
Common failure modes: labels are noisy (human annotators disagree), the data distribution shifts between collection and deployment, or the dataset is too small for the chosen model class.
3. Exploratory Data Analysis
Look at the data before modeling. Summary statistics, distributions, correlations, class balance, missing value patterns. This step prevents you from training a model on garbage.
4. Feature Engineering
Transform raw data into features the model can use. Domain knowledge matters more than model sophistication here. A good feature set with logistic regression often beats a neural network on raw features.
5. Model Selection
Choose a model class appropriate for the problem. Considerations: data size, feature type (tabular, image, text), latency requirements, interpretability needs. For tabular data with fewer than 10K rows, gradient boosting usually beats deep learning.
6. Training
Fit the model. This includes hyperparameter tuning, regularization choices, and convergence monitoring. Use a validation set to select hyperparameters. Never tune on the test set.
7. Evaluation
Measure performance on a held-out test set. Use metrics aligned with the business objective. Accuracy is rarely the right metric; precision, recall, F1, or calibration error are usually more informative.
8. Deployment
Serve the model in production. This involves model serialization, API design, latency optimization, and infrastructure provisioning. The gap between a Jupyter notebook and a production system is large.
9. Monitoring
Track model performance after deployment. Data distributions shift over time. A model trained on 2023 data may degrade on 2025 data. Detect drift, retrain on schedule, and maintain fallback systems.
Why Most Projects Fail at Data Quality
Data Quality Dominance
Statement
For a fixed model architecture and training procedure, the achievable test error is lower-bounded by the label noise rate . If labels are wrong with probability , no model can achieve population accuracy greater than on the correctly-labeled distribution, regardless of sample size.
Intuition
If 10% of your labels are wrong, you cannot get below 10% error (on the correctly-labeled distribution) by changing the model. Improving label quality from 90% to 99% does more than any model architecture change.
Proof Sketch
Let be the Bayes-optimal classifier for the true distribution. A model trained on noisy labels minimizes risk under the noisy distribution. The population risk under the true distribution satisfies for any that fits the noisy labels, because the model is incentivized to match the noise.
Why It Matters
This is why most ML projects should spend 60-80% of their effort on data: collection, cleaning, labeling, and validation. Switching from ResNet to Vision Transformer rarely matters if your labels are 85% accurate.
Failure Mode
This bound assumes the noise is label-dependent and adversarial to the true task. Symmetric label noise (uniform random flips) can sometimes be corrected with noise-robust loss functions. The bound is tighter when noise is class-conditional and correlated with features.
Cross-Functional Requirements
An ML system must satisfy requirements beyond accuracy:
- Latency: prediction time per request. Real-time applications need ms. Batch applications can tolerate minutes.
- Throughput: predictions per second. Scales with hardware and batching.
- Cost: compute cost per prediction. Larger models cost more to serve.
- Fairness: performance across demographic groups. A model with 95% overall accuracy but 70% accuracy on a minority group may be unacceptable.
- Privacy: does the model leak training data? Differential privacy and federated learning address this.
MLOps Basics
MLOps
MLOps is the set of practices for deploying and maintaining ML models in production reliably and efficiently. It extends DevOps principles to ML systems, adding version control for data and models, experiment tracking, automated retraining, and model monitoring.
Key MLOps components:
- CI/CD for models: automated testing of model quality on every code change. Tests include data validation, model performance regression tests, and integration tests.
- Model registry: versioned storage of trained models with metadata (training data version, hyperparameters, metrics). Enables rollback.
- A/B testing: serve the new model to a fraction of traffic and compare against the current model on production metrics.
- Feature stores: centralized computation and serving of features, ensuring consistency between training and inference.
Common Confusions
ML projects are not software projects with a model inside
Standard software is deterministic: given the same input, it produces the same output. ML systems are stochastic, data-dependent, and degrade silently. Testing, deployment, and monitoring all require structurally different approaches.
More data is not always better
More noisy data can hurt. More data from a different distribution than your deployment target hurts. Data quality (correct labels, representative distribution) matters more than data quantity beyond a sufficient threshold.
Key Takeaways
- Problem definition and data quality determine the ceiling; model choice determines how close you get to it
- Evaluation must use metrics aligned with the actual business objective
- Deployment and monitoring are where most engineering effort goes in production systems
- The full lifecycle is iterative: monitoring reveals problems that send you back to data collection or feature engineering
Exercises
Problem
You are building a fraud detection system. The dataset has 1% fraud cases and 99% legitimate transactions. A model that always predicts "legitimate" achieves 99% accuracy. Why is this model useless, and what metric should you use instead?
Problem
Your model achieves 92% accuracy on the test set, but after deployment, production accuracy drops to 84% within three months. List three possible causes and describe how you would diagnose each.
References
Canonical:
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (2015), NeurIPS
- Polyzotis et al., "Data Management Challenges in Production Machine Learning" (2017), SIGMOD
Current:
-
Google, "Rules of ML" (2023), Section on ML system design
-
Huyen, Designing Machine Learning Systems (2022), Chapters 1-3
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
-
Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 11-14
Next Topics
- Train-test split and data leakage: preventing information contamination
- Exploratory data analysis: understanding your data before modeling
- Experiment tracking and tooling: managing ML experiments systematically
Last reviewed: April 2026