Cubist and Model Trees

Sneiderman, Robby

ML Methods

Cubist and Model Trees

M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.

CoreTier 3StableSupporting~40 min

Prerequisites

Decision Trees and Ensembles Linear Regression

Prereq Map

Learning position

Read this page in the graph.

ml-methods | layer 2 | tier 3. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Gradient Boosting

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Standard regression trees predict a constant at each leaf. This is crude: if the true relationship within a leaf region is linear, a constant prediction wastes information. Model trees fix this by fitting a linear regression at each leaf. The result is a piecewise linear model with automatic partitioning of the feature space.

Cubist, developed by Quinlan (the creator of C4.5 and ID3), extends model trees with smoothing and rule simplification. It remains competitive on tabular regression problems, especially when the data has a mix of linear and nonlinear structure.

M5 Model Trees

Definition

Model Tree

A model tree is a decision tree for regression where each leaf node contains a linear regression model instead of a constant prediction. The tree partitions the feature space into regions, and within each region, a linear function of the features is used for prediction.

Building an M5 Model Tree

The construction follows Quinlan's M5 algorithm (1992):

Grow: Build a regression tree using a splitting criterion that maximizes the reduction in prediction error.
Fit leaf models: At each leaf, fit a linear regression using only the features that appear in the subtree above that leaf.
Prune: Simplify leaf models by dropping features whose contribution is not justified by the data. Estimate error with a penalty for model complexity.
Smooth: Optionally adjust leaf predictions using information from parent nodes.

Proposition

M5 Splitting Criterion

Statement

At each internal node, M5 selects the split that maximizes the standard deviation reduction (SDR):

$\text{SDR} = \text{sd}(T) - \sum_{i \in \{L, R\}} \frac{|T_i|}{|T|} \text{sd}(T_i)$

where $T$ is the set of instances at the current node, $T_L$ and $T_R$ are the left and right subsets, and $\text{sd}(\cdot)$ is the standard deviation of the target variable.

Intuition

This is the regression analog of information gain. We want the split that creates the most homogeneous subgroups. SDR is analogous to CART's variance reduction criterion, but uses standard deviation rather than variance. Because the square root is concave, the two criteria can select different splits in some cases; in practice, they usually agree.

Proof Sketch

Total variance decomposes as within-group variance plus between-group variance. Maximizing SDR is equivalent to maximizing between-group variance. For a fixed partition into two groups, this is a standard ANOVA decomposition.

Why It Matters

The splitting criterion determines the partition of feature space. Good splits create regions where a linear model is a good local approximation. Poorly chosen splits create regions where the linear model must work harder.

Failure Mode

SDR, like all greedy splitting criteria, finds locally optimal splits. It may miss globally better partitions. Also, it does not account for the quality of the linear model that will be fit in each region. A split might produce homogeneous groups in terms of the target but create regions where the feature-target relationship is highly nonlinear.

report a correction →

Cubist

Cubist is Quinlan's commercial extension of M5, with several enhancements.

Rule Simplification

After building the tree, Cubist converts it to a set of rules. Each rule corresponds to a path from root to leaf. The rules are then simplified: conditions that do not contribute to prediction accuracy are dropped. This can produce rules with different conditions than the original tree paths.

Smoothing

Proposition

Cubist Smoothing Procedure

Statement

When predicting for a new instance, Cubist computes the leaf model prediction $p_L$ and then adjusts it using the predictions from models at ancestor nodes. Working from the leaf up to the root, at each internal node $i$ with model prediction $p_i$ and $n_i$ training instances:

$p'_i = \frac{n_i \cdot p_i + c \cdot p_{\text{child}}}{n_i + c}$

where $c$ is a smoothing constant (typically around 15) and $p_{\text{child}}$ is the (already smoothed) prediction passed up from the child.

Intuition

Smoothing is a form of shrinkage. The leaf model is fit on a small subset of the data and may overfit. By blending the leaf prediction with ancestor predictions (which are fit on more data), we reduce variance at the cost of some bias. This is similar in spirit to the smoothing in hierarchical Bayesian models.

Proof Sketch

This is a heuristic, not derived from a formal optimization criterion. The weighted average gives more weight to the ancestor model when the leaf has few training instances ( $n_i$ is small relative to $c$ ).

Why It Matters

Smoothing significantly improves prediction accuracy in practice, especially when some leaves have few training instances. It is one of the reasons Cubist often outperforms basic model trees.

Failure Mode

The smoothing constant $c$ is a hyperparameter. If set too high, predictions collapse toward the root model (underfitting). If too low, smoothing has no effect. The value $c \approx 15$ works well empirically but has no theoretical justification.

report a correction →

Committees and Instance-Based Corrections

Cubist can also build a "committee" of model trees (similar to boosting: each subsequent tree corrects the residuals of the previous ones) and apply instance-based (nearest-neighbor) corrections to the final prediction. These extensions improve accuracy at the cost of interpretability.

Comparison with Other Methods

Method	Leaf Model	Interpretability	Handles Nonlinearity
CART regression tree	Constant	High	Through partitioning
M5 model tree	Linear regression	Medium	Partitioning + local linear
Cubist	Linear + smoothing	Medium	Partitioning + local linear + corrections
Random forest	Constant (averaged)	Low	Through ensemble of partitions
Gradient boosting	Constant (summed)	Low	Through ensemble of partitions

Model trees occupy a middle ground: more expressive than constant-leaf trees, more interpretable than ensemble methods.

When to Use Model Trees

Model trees work well when:

The target variable has locally linear relationships with features
You need some interpretability (you can inspect which features matter in which regions)
The dataset is tabular and moderate-sized
You want a single-model alternative to gradient boosting for regression

They work poorly when:

The relationship is highly nonlinear everywhere (deep forests or neural networks may be better)
The feature space is very high-dimensional with many irrelevant features
You need state-of-the-art accuracy on a Kaggle-style competition (gradient boosting usually wins)

Common Confusions

Watch Out

Model trees are not the same as regression trees

A regression tree (CART) predicts a constant at each leaf. A model tree predicts a linear function at each leaf. The tree structure is the same; the leaf models differ. This distinction matters because a model tree with $k$ leaves and $d$ features has up to $k(d+1)$ parameters, while a regression tree with $k$ leaves has only $k$ parameters.

Watch Out

Cubist is not gradient boosting

Cubist's committee mode builds sequential trees that correct residuals, which sounds like gradient boosting. The difference is that each Cubist tree in the committee is a full model tree with linear leaves, and the final prediction is averaged across committee members (not summed as in gradient boosting). The mechanics are similar in spirit but different in detail.

Summary

M5 model trees: decision trees with linear regression at each leaf
Splitting by standard deviation reduction (analogous to CART's variance reduction)
Cubist extends M5 with smoothing (shrinkage toward ancestor models), rule simplification, committees, and instance-based corrections
Best suited for tabular regression with mixed linear/nonlinear structure
More interpretable than ensembles, more expressive than constant-leaf trees

Exercises

ExerciseCore

Problem

A model tree has a leaf with 5 training instances. The leaf linear model predicts 12.0 for a new instance. The parent node model predicts 10.0 and has 50 training instances. Using smoothing constant $c = 15$ , compute the smoothed prediction.

ExerciseAdvanced

Problem

Explain why fitting a linear model only on the features that appear in the path from root to leaf (rather than all features) can improve generalization. Under what conditions would using all features be better?

References

Canonical:

Quinlan, "Learning with Continuous Classes" (1992), Proceedings of AI'92
Quinlan, "Combining Instance-Based and Model-Based Learning" (1993), ICML

Current:

Kuhn & Johnson, Applied Predictive Modeling (2013), Chapter 8
The Cubist R package documentation (Max Kuhn)
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 9 (Additive Models, Trees, and Related Methods)

Next Topics

Gradient boosting: the dominant ensemble method for tabular regression
Generalized additive models: another interpretable nonlinear regression approach

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Linear Regressionlayer 1 · tier 1
Decision Trees and Ensembleslayer 2 · tier 2

Derived topics

1

Gradient Boostinglayer 2 · tier 1

Graph-backed continuations

Gradient Boosting