Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Cubist and Model Trees

M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.

CoreTier 3Stable~40 min
0

Why This Matters

Standard regression trees predict a constant at each leaf. This is crude: if the true relationship within a leaf region is linear, a constant prediction wastes information. Model trees fix this by fitting a linear regression at each leaf. The result is a piecewise linear model with automatic partitioning of the feature space.

Cubist, developed by Quinlan (the creator of C4.5 and ID3), extends model trees with smoothing and rule simplification. It remains competitive on tabular regression problems, especially when the data has a mix of linear and nonlinear structure.

M5 Model Trees

Definition

Model Tree

A model tree is a decision tree for regression where each leaf node contains a linear regression model instead of a constant prediction. The tree partitions the feature space into regions, and within each region, a linear function of the features is used for prediction.

Building an M5 Model Tree

The construction follows Quinlan's M5 algorithm (1992):

  1. Grow: Build a regression tree using a splitting criterion that maximizes the reduction in prediction error.
  2. Fit leaf models: At each leaf, fit a linear regression using only the features that appear in the subtree above that leaf.
  3. Prune: Simplify leaf models by dropping features whose contribution is not justified by the data. Estimate error with a penalty for model complexity.
  4. Smooth: Optionally adjust leaf predictions using information from parent nodes.
Proposition

M5 Splitting Criterion

Statement

At each internal node, M5 selects the split that maximizes the standard deviation reduction (SDR):

SDR=sd(T)i{L,R}TiTsd(Ti)\text{SDR} = \text{sd}(T) - \sum_{i \in \{L, R\}} \frac{|T_i|}{|T|} \text{sd}(T_i)

where TT is the set of instances at the current node, TLT_L and TRT_R are the left and right subsets, and sd()\text{sd}(\cdot) is the standard deviation of the target variable.

Intuition

This is the regression analog of information gain. We want the split that creates the most homogeneous subgroups. Maximizing SDR is equivalent to maximizing the between-group variance of the target, which is the same objective as minimizing within-group variance (the criterion used by CART for regression).

Proof Sketch

Total variance decomposes as within-group variance plus between-group variance. Maximizing SDR is equivalent to maximizing between-group variance. For a fixed partition into two groups, this is a standard ANOVA decomposition.

Why It Matters

The splitting criterion determines the partition of feature space. Good splits create regions where a linear model is a good local approximation. Poorly chosen splits create regions where the linear model must work harder.

Failure Mode

SDR, like all greedy splitting criteria, finds locally optimal splits. It may miss globally better partitions. Also, it does not account for the quality of the linear model that will be fit in each region. A split might produce homogeneous groups in terms of the target but create regions where the feature-target relationship is highly nonlinear.

Cubist

Cubist is Quinlan's commercial extension of M5, with several enhancements.

Rule Simplification

After building the tree, Cubist converts it to a set of rules. Each rule corresponds to a path from root to leaf. The rules are then simplified: conditions that do not contribute to prediction accuracy are dropped. This can produce rules with different conditions than the original tree paths.

Smoothing

Proposition

Cubist Smoothing Procedure

Statement

When predicting for a new instance, Cubist computes the leaf model prediction pLp_L and then adjusts it using the predictions from models at ancestor nodes. Working from the leaf up to the root, at each internal node ii with model prediction pip_i and nin_i training instances:

pi=nipi+cpchildni+cp'_i = \frac{n_i \cdot p_i + c \cdot p_{\text{child}}}{n_i + c}

where cc is a smoothing constant (typically around 15) and pchildp_{\text{child}} is the (already smoothed) prediction passed up from the child.

Intuition

Smoothing is a form of shrinkage. The leaf model is fit on a small subset of the data and may overfit. By blending the leaf prediction with ancestor predictions (which are fit on more data), we reduce variance at the cost of some bias. This is similar in spirit to the smoothing in hierarchical Bayesian models.

Proof Sketch

This is a heuristic, not derived from a formal optimization criterion. The weighted average gives more weight to the ancestor model when the leaf has few training instances (nin_i is small relative to cc).

Why It Matters

Smoothing significantly improves prediction accuracy in practice, especially when some leaves have few training instances. It is one of the reasons Cubist often outperforms basic model trees.

Failure Mode

The smoothing constant cc is a hyperparameter. If set too high, predictions collapse toward the root model (underfitting). If too low, smoothing has no effect. The value c15c \approx 15 works well empirically but has no theoretical justification.

Committees and Instance-Based Corrections

Cubist can also build a "committee" of model trees (similar to boosting: each subsequent tree corrects the residuals of the previous ones) and apply instance-based (nearest-neighbor) corrections to the final prediction. These extensions improve accuracy at the cost of interpretability.

Comparison with Other Methods

MethodLeaf ModelInterpretabilityHandles Nonlinearity
CART regression treeConstantHighThrough partitioning
M5 model treeLinear regressionMediumPartitioning + local linear
CubistLinear + smoothingMediumPartitioning + local linear + corrections
Random forestConstant (averaged)LowThrough ensemble of partitions
Gradient boostingConstant (summed)LowThrough ensemble of partitions

Model trees occupy a middle ground: more expressive than constant-leaf trees, more interpretable than ensemble methods.

When to Use Model Trees

Model trees work well when:

  • The target variable has locally linear relationships with features
  • You need some interpretability (you can inspect which features matter in which regions)
  • The dataset is tabular and moderate-sized
  • You want a single-model alternative to gradient boosting for regression

They work poorly when:

  • The relationship is highly nonlinear everywhere (deep forests or neural networks may be better)
  • The feature space is very high-dimensional with many irrelevant features
  • You need state-of-the-art accuracy on a Kaggle-style competition (gradient boosting usually wins)

Common Confusions

Watch Out

Model trees are not the same as regression trees

A regression tree (CART) predicts a constant at each leaf. A model tree predicts a linear function at each leaf. The tree structure is the same; the leaf models differ. This distinction matters because a model tree with kk leaves and dd features has up to k(d+1)k(d+1) parameters, while a regression tree with kk leaves has only kk parameters.

Watch Out

Cubist is not gradient boosting

Cubist's committee mode builds sequential trees that correct residuals, which sounds like gradient boosting. The difference is that each Cubist tree in the committee is a full model tree with linear leaves, and the final prediction is averaged across committee members (not summed as in gradient boosting). The mechanics are similar in spirit but different in detail.

Key Takeaways

  • M5 model trees: decision trees with linear regression at each leaf
  • Splitting by standard deviation reduction (equivalent to variance reduction)
  • Cubist extends M5 with smoothing (shrinkage toward ancestor models), rule simplification, committees, and instance-based corrections
  • Best suited for tabular regression with mixed linear/nonlinear structure
  • More interpretable than ensembles, more expressive than constant-leaf trees

Exercises

ExerciseCore

Problem

A model tree has a leaf with 5 training instances. The leaf linear model predicts 12.0 for a new instance. The parent node model predicts 10.0 and has 50 training instances. Using smoothing constant c=15c = 15, compute the smoothed prediction.

ExerciseAdvanced

Problem

Explain why fitting a linear model only on the features that appear in the path from root to leaf (rather than all features) can improve generalization. Under what conditions would using all features be better?

References

Canonical:

  • Quinlan, "Learning with Continuous Classes" (1992), Proceedings of AI'92
  • Quinlan, "Combining Instance-Based and Model-Based Learning" (1993), ICML

Current:

  • Kuhn & Johnson, Applied Predictive Modeling (2013), Chapter 8

  • The Cubist R package documentation (Max Kuhn)

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 3-15

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Next Topics

  • Gradient boosting: the dominant ensemble method for tabular regression
  • Generalized additive models: another interpretable nonlinear regression approach

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics