ML Methods
Cubist and Model Trees
M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.
Prerequisites
Why This Matters
Standard regression trees predict a constant at each leaf. This is crude: if the true relationship within a leaf region is linear, a constant prediction wastes information. Model trees fix this by fitting a linear regression at each leaf. The result is a piecewise linear model with automatic partitioning of the feature space.
Cubist, developed by Quinlan (the creator of C4.5 and ID3), extends model trees with smoothing and rule simplification. It remains competitive on tabular regression problems, especially when the data has a mix of linear and nonlinear structure.
M5 Model Trees
Model Tree
A model tree is a decision tree for regression where each leaf node contains a linear regression model instead of a constant prediction. The tree partitions the feature space into regions, and within each region, a linear function of the features is used for prediction.
Building an M5 Model Tree
The construction follows Quinlan's M5 algorithm (1992):
- Grow: Build a regression tree using a splitting criterion that maximizes the reduction in prediction error.
- Fit leaf models: At each leaf, fit a linear regression using only the features that appear in the subtree above that leaf.
- Prune: Simplify leaf models by dropping features whose contribution is not justified by the data. Estimate error with a penalty for model complexity.
- Smooth: Optionally adjust leaf predictions using information from parent nodes.
M5 Splitting Criterion
Statement
At each internal node, M5 selects the split that maximizes the standard deviation reduction (SDR):
where is the set of instances at the current node, and are the left and right subsets, and is the standard deviation of the target variable.
Intuition
This is the regression analog of information gain. We want the split that creates the most homogeneous subgroups. Maximizing SDR is equivalent to maximizing the between-group variance of the target, which is the same objective as minimizing within-group variance (the criterion used by CART for regression).
Proof Sketch
Total variance decomposes as within-group variance plus between-group variance. Maximizing SDR is equivalent to maximizing between-group variance. For a fixed partition into two groups, this is a standard ANOVA decomposition.
Why It Matters
The splitting criterion determines the partition of feature space. Good splits create regions where a linear model is a good local approximation. Poorly chosen splits create regions where the linear model must work harder.
Failure Mode
SDR, like all greedy splitting criteria, finds locally optimal splits. It may miss globally better partitions. Also, it does not account for the quality of the linear model that will be fit in each region. A split might produce homogeneous groups in terms of the target but create regions where the feature-target relationship is highly nonlinear.
Cubist
Cubist is Quinlan's commercial extension of M5, with several enhancements.
Rule Simplification
After building the tree, Cubist converts it to a set of rules. Each rule corresponds to a path from root to leaf. The rules are then simplified: conditions that do not contribute to prediction accuracy are dropped. This can produce rules with different conditions than the original tree paths.
Smoothing
Cubist Smoothing Procedure
Statement
When predicting for a new instance, Cubist computes the leaf model prediction and then adjusts it using the predictions from models at ancestor nodes. Working from the leaf up to the root, at each internal node with model prediction and training instances:
where is a smoothing constant (typically around 15) and is the (already smoothed) prediction passed up from the child.
Intuition
Smoothing is a form of shrinkage. The leaf model is fit on a small subset of the data and may overfit. By blending the leaf prediction with ancestor predictions (which are fit on more data), we reduce variance at the cost of some bias. This is similar in spirit to the smoothing in hierarchical Bayesian models.
Proof Sketch
This is a heuristic, not derived from a formal optimization criterion. The weighted average gives more weight to the ancestor model when the leaf has few training instances ( is small relative to ).
Why It Matters
Smoothing significantly improves prediction accuracy in practice, especially when some leaves have few training instances. It is one of the reasons Cubist often outperforms basic model trees.
Failure Mode
The smoothing constant is a hyperparameter. If set too high, predictions collapse toward the root model (underfitting). If too low, smoothing has no effect. The value works well empirically but has no theoretical justification.
Committees and Instance-Based Corrections
Cubist can also build a "committee" of model trees (similar to boosting: each subsequent tree corrects the residuals of the previous ones) and apply instance-based (nearest-neighbor) corrections to the final prediction. These extensions improve accuracy at the cost of interpretability.
Comparison with Other Methods
| Method | Leaf Model | Interpretability | Handles Nonlinearity |
|---|---|---|---|
| CART regression tree | Constant | High | Through partitioning |
| M5 model tree | Linear regression | Medium | Partitioning + local linear |
| Cubist | Linear + smoothing | Medium | Partitioning + local linear + corrections |
| Random forest | Constant (averaged) | Low | Through ensemble of partitions |
| Gradient boosting | Constant (summed) | Low | Through ensemble of partitions |
Model trees occupy a middle ground: more expressive than constant-leaf trees, more interpretable than ensemble methods.
When to Use Model Trees
Model trees work well when:
- The target variable has locally linear relationships with features
- You need some interpretability (you can inspect which features matter in which regions)
- The dataset is tabular and moderate-sized
- You want a single-model alternative to gradient boosting for regression
They work poorly when:
- The relationship is highly nonlinear everywhere (deep forests or neural networks may be better)
- The feature space is very high-dimensional with many irrelevant features
- You need state-of-the-art accuracy on a Kaggle-style competition (gradient boosting usually wins)
Common Confusions
Model trees are not the same as regression trees
A regression tree (CART) predicts a constant at each leaf. A model tree predicts a linear function at each leaf. The tree structure is the same; the leaf models differ. This distinction matters because a model tree with leaves and features has up to parameters, while a regression tree with leaves has only parameters.
Cubist is not gradient boosting
Cubist's committee mode builds sequential trees that correct residuals, which sounds like gradient boosting. The difference is that each Cubist tree in the committee is a full model tree with linear leaves, and the final prediction is averaged across committee members (not summed as in gradient boosting). The mechanics are similar in spirit but different in detail.
Key Takeaways
- M5 model trees: decision trees with linear regression at each leaf
- Splitting by standard deviation reduction (equivalent to variance reduction)
- Cubist extends M5 with smoothing (shrinkage toward ancestor models), rule simplification, committees, and instance-based corrections
- Best suited for tabular regression with mixed linear/nonlinear structure
- More interpretable than ensembles, more expressive than constant-leaf trees
Exercises
Problem
A model tree has a leaf with 5 training instances. The leaf linear model predicts 12.0 for a new instance. The parent node model predicts 10.0 and has 50 training instances. Using smoothing constant , compute the smoothed prediction.
Problem
Explain why fitting a linear model only on the features that appear in the path from root to leaf (rather than all features) can improve generalization. Under what conditions would using all features be better?
References
Canonical:
- Quinlan, "Learning with Continuous Classes" (1992), Proceedings of AI'92
- Quinlan, "Combining Instance-Based and Model-Based Learning" (1993), ICML
Current:
-
Kuhn & Johnson, Applied Predictive Modeling (2013), Chapter 8
-
The
CubistR package documentation (Max Kuhn) -
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15
-
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14
Next Topics
- Gradient boosting: the dominant ensemble method for tabular regression
- Generalized additive models: another interpretable nonlinear regression approach
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Decision Trees and EnsemblesLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Bias-Variance TradeoffLayer 2
- Linear RegressionLayer 1
- Matrix Operations and PropertiesLayer 0A
- Maximum Likelihood EstimationLayer 0B
- Differentiation in RnLayer 0A