Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Model Timeline

GPT Series Evolution

The progression from GPT-1 (117M) to GPT-4: how each generation revealed new capabilities through scale, training methodology, and alignment techniques.

CoreTier 2Frontier~55 min
0

Why This Matters

The GPT series is the clearest empirical record of what happens when you scale autoregressive language models. Each generation did not just improve benchmarks; it revealed qualitatively new capabilities. GPT-1 showed pretraining helps. GPT-2 showed zero-shot generation is possible. GPT-3 showed in-context learning emerges at scale. InstructGPT/ChatGPT showed RLHF makes models useful. GPT-4 showed multimodal reasoning at unprecedented quality. Understanding this progression grounds abstract discussions about scaling laws in concrete observations.

GPT-1 (2018): Pretraining Works for Generation

Architecture. Decoder-only transformer, 12 layers, 768 hidden dimension, 12 attention heads. 117M parameters.

Training. Autoregressive language modeling on BooksCorpus (~7,000 unpublished books, ~800M words). Standard next-token prediction objective.

Key result. After pretraining, fine-tuning on downstream tasks (textual entailment, question answering, sentiment) improved over training from scratch. This was the decoder-only analog of BERT's finding for encoders.

What it taught us. Unsupervised pretraining with next-token prediction learns representations that transfer to supervised tasks. The decoder-only architecture is viable for understanding tasks, not just generation.

Limitations. 117M parameters is small. The model required task-specific fine-tuning and could not perform tasks zero-shot. Performance was good but not state-of-the-art on most benchmarks (BERT, released months later, overtook GPT-1 on understanding tasks).

GPT-2 (2019): Zero-Shot Generation

Architecture. Same decoder-only transformer design, scaled to 1.5B parameters (GPT-2 XL). 48 layers, 1600 hidden dimension, 25 attention heads.

Training. WebText dataset: 8 million web pages (40GB of text) curated by following outbound links from Reddit posts with at least 3 karma. No fine-tuning on downstream tasks.

Key result. GPT-2 performed tasks zero-shot by framing them as text completion. Translation: provide an English sentence followed by "French:" and the model completes with a French translation. Summarization: provide an article followed by "TL;DR:" and the model generates a summary.

What it taught us. A language model trained on diverse text implicitly learns to perform many tasks. Scaling from 117M to 1.5B parameters improved zero-shot performance substantially. The model memorized some training data, raising concerns about privacy and misinformation.

The "too dangerous to release" claim. OpenAI initially released only the 124M version, citing concerns about misuse. The full 1.5B model was released 9 months later. In retrospect, the capabilities were modest by current standards, and the staged release was more about establishing norms than preventing actual harm.

GPT-3 (2020): In-Context Learning

Architecture. 175B parameters. 96 layers, 12288 hidden dimension, 96 attention heads. Context length: 2048 tokens.

Training. Mix of Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia. ~300B tokens total, with data weighted by quality (Wikipedia upweighted, Common Crawl downweighted).

Key result. In-context learning: GPT-3 can perform new tasks by conditioning on a few examples in the prompt, without any gradient updates. Provide 3 examples of English-to-French translation in the prompt, then a new English sentence, and GPT-3 translates it. This was not trained for; it emerged from scale.

Definition

In-Context Learning (ICL)

In-context learning is the ability of a language model to learn a task from examples provided in the prompt at inference time, with no parameter updates. Given prompt examples (x1,y1),,(xk,yk)(x_1, y_1), \ldots, (x_k, y_k) and a new input xk+1x_{k+1}, the model generates yk+1y_{k+1} by conditioning on the full prompt. The "learning" happens through the forward pass attention mechanism, not through gradient descent.

Few-shot vs. zero-shot vs. one-shot. GPT-3 was evaluated in three settings: zero-shot (task description only), one-shot (one example), and few-shot (multiple examples). Few-shot performance on many tasks approached or exceeded fine-tuned BERT models, without any fine-tuning.

What it taught us. Scale enables qualitative capability jumps. In-context learning does not exist at 1.5B parameters (GPT-2 shows weak ICL) but is strong at 175B. This is an empirical observation; there is no clean theory explaining why ICL emerges.

Main Theorems

Proposition

In-Context Learning as Implicit Bayesian Inference

Statement

A sufficiently large transformer pretrained on a mixture of tasks can perform approximate Bayesian inference over task identity given in-context examples. Concretely, if the pretraining distribution is a mixture p(x)=tp(t)pt(x)p(x) = \sum_t p(t) p_t(x) over tasks tt, then conditioning on examples (x1,y1),,(xk,yk)(x_1, y_1), \ldots, (x_k, y_k) in the prompt approximates:

p(yk+1xk+1,x1:k,y1:k)tp(tx1:k,y1:k)pt(yk+1xk+1)p(y_{k+1} \mid x_{k+1}, x_{1:k}, y_{1:k}) \approx \sum_t p(t \mid x_{1:k}, y_{1:k}) \cdot p_t(y_{k+1} \mid x_{k+1})

The model implicitly infers which task the examples come from and applies that task's input-output mapping.

Intuition

The pretraining corpus contains many implicit "tasks" (translation passages, Q&A pairs, code with comments). When in-context examples look like a particular task, the transformer's attention mechanism routes computation through pathways learned for that task. More examples sharpen the posterior over task identity.

Proof Sketch

Xie et al. (2022) showed that transformers trained on sequences generated by hidden Markov models can perform exact Bayesian prediction in-context. The key condition is that pretraining data contains latent structure (different "concepts" or "tasks") and the transformer has enough capacity to represent the posterior. This is a stylized result; real-world ICL likely involves additional mechanisms.

Why It Matters

This framework explains why ICL improves with more in-context examples (more evidence for task identification), why it fails with out-of-distribution tasks (no matching task in the pretraining mixture), and why larger models are better at ICL (more capacity to represent the task mixture).

Failure Mode

ICL fails when (1) the task was not represented in pretraining data, (2) the prompt format is unfamiliar, (3) the task requires reasoning beyond what the model learned, or (4) the examples are misleading or inconsistent. ICL is also sensitive to example ordering and formatting, which true Bayesian inference would not be.

InstructGPT and ChatGPT (2022): Alignment via RLHF

The problem. GPT-3 is powerful but difficult to use. It completes text, not instructions. Ask it a question and it may generate more questions instead of answering. It can produce toxic, biased, or factually wrong text because it was trained to predict internet text, which contains all of these.

The solution. Reinforcement Learning from Human Feedback (RLHF), applied in three stages:

  1. Supervised fine-tuning (SFT). Collect human demonstrations of desired behavior (following instructions, answering questions helpfully). Fine-tune GPT-3 on these demonstrations.

  2. Reward model training. Collect human preferences (which of two responses is better). Train a reward model to predict human preferences.

  3. RL optimization. Use PPO (Proximal Policy Optimization) to fine-tune the SFT model to maximize the reward model's score while staying close to the SFT model (KL penalty).

Key result. InstructGPT (1.3B parameters with RLHF) was preferred by humans over GPT-3 (175B) in 85% of comparisons. Alignment techniques matter more than raw scale for producing useful outputs.

What it taught us. The pretraining objective (next-token prediction) optimizes for predicting internet text, not for being helpful or truthful. RLHF bridges this gap. ChatGPT applied this recipe to GPT-3.5 and demonstrated that aligned language models have broad commercial viability.

GPT-4 (2023): Multimodal Reasoning

Known details. Accepts both text and image inputs. Significantly better reasoning (bar exam: ~90th percentile, up from ~10th for GPT-3.5). Context length: 8K and 32K token variants.

Unknown details. OpenAI did not disclose architecture size, training data, or training compute. Rumored to be a mixture-of-experts architecture, but not confirmed.

What it taught us. Multimodal input (processing images alongside text) extends the capabilities significantly. The gains from GPT-3.5 to GPT-4 on reasoning benchmarks suggest either substantial scaling, architectural improvements, better training data, or some combination. Without disclosure, the ML community cannot determine which factors contributed most.

Common Confusions

Watch Out

GPT-3 was not fine-tuned for in-context learning

In-context learning was not an explicit training objective. GPT-3 was trained purely on next-token prediction. ICL is an emergent behavior that becomes stronger with scale. This is one of the most surprising findings in modern ML: capabilities appear without being explicitly trained for.

Watch Out

RLHF does not teach new knowledge

RLHF adjusts which knowledge the model expresses and how it expresses it. It does not add new factual knowledge. A model that does not know a fact after pretraining still does not know it after RLHF. RLHF makes the model more likely to admit ignorance instead of confabulating, but the underlying knowledge base is set during pretraining.

Watch Out

Scale alone does not explain GPT-4

While GPT-4 is likely larger than GPT-3, OpenAI also improved training data quality, training methodology, and post-training alignment. Attributing GPT-4's improvements solely to parameter count is speculation. The lack of technical disclosure makes it impossible to isolate the contribution of each factor.

Canonical Examples

Example

In-context learning for sentiment classification

Prompt to GPT-3: "Review: This movie was terrible. Sentiment: Negative. Review: I loved every minute of it. Sentiment: Positive. Review: The acting was mediocre but the plot was engaging. Sentiment:" GPT-3 completes with "Positive" or "Mixed" depending on the model size and sampling. No gradient updates are performed. The model infers the task from the examples and applies it to the new input.

Exercises

ExerciseCore

Problem

GPT-1 had 117M parameters and was trained on ~800M words. GPT-3 had 175B parameters and was trained on ~300B tokens. Compute the ratio of parameters and training tokens between the two models. What does this suggest about whether GPT-3 was "compute-optimal" according to the Chinchilla scaling law (tokens should scale linearly with parameters)?

ExerciseAdvanced

Problem

Explain why in-context learning is sensitive to the order and format of examples in the prompt, while a fine-tuned model is not sensitive to the order of training examples. What does this imply about the mechanism underlying ICL?

References

Canonical:

  • Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018), GPT-1
  • Radford et al., "Language Models are Unsupervised Multitask Learners" (2019), GPT-2
  • Brown et al., "Language Models are Few-Shot Learners" (2020), GPT-3, NeurIPS

Current:

  • Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022), InstructGPT
  • OpenAI, "GPT-4 Technical Report" (2023)
  • Xie et al., "An Explanation of In-context Learning as Implicit Bayesian Inference" (2022), ICLR

Next Topics

  • Open weight models (LLaMA): the open-source response to closed GPT models
  • Post-training overview: RLHF, DPO, and instruction tuning in detail

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics