Skip to main content

Statistical Foundations

GREG Estimator

The generalized regression estimator is the standard model-assisted survey estimator: Horvitz-Thompson plus a regression correction using known auxiliary totals.

AdvancedTier 2Stable~45 min
0

Why This Matters

The Horvitz-Thompson estimator is design-unbiased, but it can be noisy. A pure model-based regression estimator can be efficient, but it inherits model risk. The generalized regression estimator, usually shortened to GREG, is the standard compromise in survey statistics: use a regression model to gain efficiency, but keep the design-based backbone visible.

This is the canonical model-assisted estimator in official statistics. It also matters conceptually because it is the cleanest bridge between three ideas that often get taught separately:

  1. Horvitz-Thompson design weighting
  2. regression adjustment
  3. calibration weighting

If your site wants strong grounding across survey methodology, this page is one of the load-bearing links.

Mental Model

Start with the Horvitz-Thompson estimate of a population total. Then ask a simple question:

We know the population total of some auxiliary variable xx. How much does the weighted sample miss that total, and how should that discrepancy change the estimate of yy?

GREG answers by fitting a regression of yy on xx in the sample, then using the known population total of xx to correct the Horvitz-Thompson estimate of yy.

If the sample underrepresents units with large xx and xx predicts yy, GREG pushes the estimate upward. If the sample already matches the auxiliary totals well, the correction is small.

Formal Setup

Definition

Horvitz-Thompson Baseline

Let U={1,,N}U = \{1,\ldots,N\} be a finite population, ss a probability sample, and di=1/πid_i = 1 / \pi_i the design weight for sampled unit ii. The Horvitz-Thompson estimator of a population total Ty=iUyiT_y = \sum_{i \in U} y_i is

T^y,HT=isdiyi.\hat{T}_{y,\mathrm{HT}} = \sum_{i \in s} d_i y_i.

For an auxiliary vector xiRpx_i \in \mathbb{R}^p whose population total X=iUxiX = \sum_{i \in U} x_i is known, the Horvitz-Thompson estimate of that total is

T^x,HT=isdixi.\hat{T}_{x,\mathrm{HT}} = \sum_{i \in s} d_i x_i.

Definition

GREG Estimator

Let β^\hat{\beta} be a weighted sample regression coefficient estimating the relationship between yy and xx. The generalized regression estimator of TyT_y is

T^y,GREG=T^y,HT+(XT^x,HT)β^.\hat{T}_{y,\mathrm{GREG}} = \hat{T}_{y,\mathrm{HT}} + \left(X - \hat{T}_{x,\mathrm{HT}}\right)^\top \hat{\beta}.

This is the Horvitz-Thompson estimator plus a regression correction using the known population discrepancy in the auxiliary totals.

Definition

Weighted Sample Regression

A standard GREG choice is

β^=(isdiqixixi)1(isdiqixiyi),\hat{\beta} = \left(\sum_{i \in s} d_i q_i x_i x_i^\top\right)^{-1} \left(\sum_{i \in s} d_i q_i x_i y_i\right),

where qi>0q_i > 0 are optional tuning constants. When xix_i includes an intercept, this is a weighted least-squares fit using the sample and design weights.

Main Theorem

Theorem

GREG as Quadratic Calibration Estimator

Statement

Consider calibration weights wiw_i obtained by minimizing

is(widi)22diqi\sum_{i \in s} \frac{(w_i - d_i)^2}{2 d_i q_i}

subject to the calibration constraint

iswixi=X.\sum_{i \in s} w_i x_i = X.

Then the solution is

wi=di(1+qixiλ),w_i = d_i\left(1 + q_i x_i^\top \lambda\right),

where

λ=(isdiqixixi)1(XT^x,HT).\lambda = \left(\sum_{i \in s} d_i q_i x_i x_i^\top\right)^{-1} \left(X - \hat{T}_{x,\mathrm{HT}}\right).

The resulting calibrated estimator of the total,

T^y,cal=iswiyi,\hat{T}_{y,\mathrm{cal}} = \sum_{i \in s} w_i y_i,

is exactly the GREG estimator:

T^y,cal=T^y,HT+(XT^x,HT)β^.\hat{T}_{y,\mathrm{cal}} = \hat{T}_{y,\mathrm{HT}} + \left(X - \hat{T}_{x,\mathrm{HT}}\right)^\top \hat{\beta}.

Intuition

Quadratic calibration says: change the design weights as little as possible, measured in squared distance, while forcing the weighted sample to match the known auxiliary totals. The induced estimator is not merely similar to GREG. It is GREG.

Proof Sketch

Write the Lagrangian for the quadratic objective plus the calibration constraint. Differentiating with respect to wiw_i gives the affine form wi=di(1+qixiλ)w_i = d_i(1 + q_i x_i^\top \lambda). Substituting into the constraint yields the closed-form expression for λ\lambda. Expanding iswiyi\sum_{i \in s} w_i y_i then gives the Horvitz-Thompson term plus the regression correction, with β^=(diqixixi)1diqixiyi\hat{\beta} = (\sum d_i q_i x_i x_i^\top)^{-1}\sum d_i q_i x_i y_i.

Why It Matters

This theorem unifies the model-assisted and calibration views. GREG is not a random weighted regression trick; it is the quadratic-distance member of the calibration family. That is why it sits naturally between Horvitz-Thompson and general calibration or raking.

Failure Mode

If the weighted moment matrix is singular or ill-conditioned, the regression correction is unstable. If the auxiliary variables are weakly related to yy, the correction adds little and can even increase variance. If the known total XX is itself wrong or poorly aligned with the target population, the calibration correction moves the estimate in the wrong direction.

Design Properties

The exact theorem above is algebraic. The inferential appeal of GREG comes from its design properties.

  • Under standard regularity conditions, GREG is design-consistent for the finite-population total.
  • If the working linear model for yy on xx is good, GREG is typically more efficient than Horvitz-Thompson because the correction absorbs predictable structure.
  • If the model is wrong, GREG still keeps the design-based Horvitz-Thompson core. That is the point of model-assisted inference: use the model for efficiency, not for full justification.

This is why GREG is often the first serious answer when someone asks, "Can I use a model without giving up design-based validity?"

GREG vs. HT vs. Pure Regression

  • Horvitz-Thompson uses only the design. It is robust, but often noisy.
  • Pure regression estimator predicts the population from a model. It can be efficient, but it leans heavily on model correctness.
  • GREG starts from Horvitz-Thompson, then applies a model-based correction only to the discrepancy in known auxiliary totals.

That structure matters. GREG is not "just run weighted least squares." The known population total XX is part of the estimator itself. Without that known benchmark, you do not have GREG.

Canonical Example

Example

Correcting a payroll total with known employee counts

Suppose a business survey estimates total payroll TyT_y. For each sampled firm, you also know employee count xix_i, and the total number of employees in the population is known from a register: X=1000X = 1000.

The Horvitz-Thompson estimate of total payroll is 4.55 million USD. But the Horvitz-Thompson estimate of total employees from the same sample is only 920, so the weighted sample undercovers employee count by 80. A weighted sample regression gives an estimated slope of 4200 USD payroll per employee.

The GREG correction is

(1000920)×4200=336,000.\left(1000 - 920\right) \times 4200 = 336{,}000.

So the GREG estimate becomes

4.55 million USD+0.336 million USD=4.886 million USD.4.55 \text{ million USD} + 0.336 \text{ million USD} = 4.886 \text{ million USD}.

The intuition is simple: the sample appears to miss employees, and payroll is strongly related to employee count, so the total payroll estimate should move upward.

When GREG Helps

GREG is most useful when three conditions line up:

  1. the auxiliary totals are known and trusted
  2. the auxiliary variables are strongly related to the study variable
  3. the weighted sample misses those auxiliary totals by enough to matter

If the sample already matches the auxiliary totals well, the correction is small. If the auxiliary variables are weak predictors, the correction adds noise with little payoff.

Common Confusions

Watch Out

GREG is not just weighted least squares

Weighted regression produces β^\hat{\beta}, but GREG is the full estimator T^y,HT+(XT^x,HT)β^\hat{T}_{y,\mathrm{HT}} + (X - \hat{T}_{x,\mathrm{HT}})^\top \hat{\beta}. The known population total of xx is essential.

Watch Out

A good predictive model does not automatically imply a good GREG estimator

If the auxiliary total XX is wrong, the sample design is highly irregular, or the weighted regression is unstable, the correction can hurt. Prediction and finite-population estimation are related, but they are not identical problems.

Watch Out

GREG is not fully model-based

The working regression model improves efficiency, but the estimator is still judged under the sampling design. That is why GREG belongs to the model-assisted, not purely model-based, camp.

Summary

  • GREG is the standard model-assisted survey estimator
  • It equals Horvitz-Thompson plus a regression correction using known auxiliary totals
  • Under quadratic calibration, the calibrated estimator is exactly GREG
  • GREG is attractive because it keeps the design-based backbone while gaining efficiency from auxiliary information
  • It helps most when the auxiliary totals are trusted and the auxiliary variables predict the study variable well

Exercises

ExerciseCore

Problem

The Horvitz-Thompson estimate of a study total is 1200. The known population total of auxiliary variable xx is X=500X = 500, while the Horvitz-Thompson estimate of that total is 470. The fitted regression slope is β^=3\hat{\beta} = 3. Compute the GREG estimate.

ExerciseAdvanced

Problem

Why can GREG reduce variance even when the weighted sample regression model is not exactly correct?

References

Canonical:

  • Särndal, Swensson, and Wretman, Model Assisted Survey Sampling (1992), Chapters 5-7.
  • Deville and Särndal, "Calibration Estimators in Survey Sampling," Journal of the American Statistical Association 87(418), 376-382 (1992). DOI
  • Cochran, Sampling Techniques, 3rd ed. (1977), Chapter 7 on ratio and regression estimators.

Interpretive and operational:

  • Statistics Canada, "Optimal calibration weights under unit nonresponse in survey sampling," Section 2.1 on calibration estimation (2019). Link
  • Lumley, Complex Surveys: A Guide to Analysis Using R (2010), Chapter 7 on calibration and regression estimators.
  • Valliant, Dever, and Kreuter, Practical Tools for Designing and Weighting Survey Samples (2018), Chapters 9-10.

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics