Model Timeline
History of Artificial Intelligence
Five recurring questions drive AI history: representation, learning, search, uncertainty, and tractability. From Turing 1936 through the transformer era, every advance answered at least one of these questions differently.
Why This Matters
Most AI history is told as a sequence of breakthroughs and winters. That framing obscures the actual structure. The same five questions recur in every era: How should knowledge be represented? How should a system learn from data? How should it search through possibilities? How should it handle uncertainty? How do you make any of this tractable?
Every major advance in AI is a new answer to at least one of these questions. Understanding the history this way prevents you from thinking that the current approach (large neural networks trained on internet data) is the final answer. It also prevents the opposite error: dismissing scale because previous hype cycles ended badly.
The Five Recurring Questions
- Representation: What data structures encode knowledge? Logic formulas, feature vectors, distributed representations, attention patterns.
- Learning: How does the system improve from experience? Hand-coded rules, perceptron updates, backpropagation, self-supervised pretraining.
- Search and optimization: How does the system find good solutions in large spaces? Tree search, gradient descent, evolutionary methods, beam search.
- Uncertainty: How does the system reason about what it does not know? Boolean logic, fuzzy sets, probability, Bayesian networks, calibration.
- Tractability: How do you make the above work at useful scale? Approximation, parallelism, sparsity, hardware co-design.
Every era below can be understood as a shift in the dominant answer to one or more of these questions.
Era 1: Computability and the Idea of Machine Thought (1936-1950)
Turing's 1936 paper on computable numbers established that a single machine can compute anything that is computable, given enough time and tape. This is the mathematical precondition for AI: if intelligence involves computation, then a universal machine can in principle replicate it.
McCulloch and Pitts (1943) showed that networks of idealized binary neurons can compute any Boolean function. This was a representation result: neural activity could be described in the language of propositional logic.
Hebb (1949) proposed that synaptic connections strengthen when pre-synaptic and post-synaptic neurons fire together. This was a learning result: "neurons that fire together wire together." Hebb's rule is unsupervised and local. It requires no teacher signal and no global error computation.
Era 2: The Founding Coalition (1950-1960)
The Dartmouth workshop (1956) is conventionally called the founding of AI. The key participants (McCarthy, Minsky, Rochester, Shannon) had diverse views. McCarthy favored logic and search. Minsky was interested in neural networks at this point (his Princeton thesis was on neural net learning). Shannon contributed information theory. Rochester had built a hardware neural network simulator at IBM.
The founding coalition was not purely symbolic. The symbolic-vs-connectionist divide came later.
Rosenblatt's Perceptron (1957) was a hardware learning machine that could classify patterns by adjusting weights. The perceptron convergence theorem states: if the training data is linearly separable, the algorithm finds a separating hyperplane in finite steps.
Era 3: Symbolic Problem Solving (1956-1974)
Newell and Simon's Logic Theorist (1956) and General Problem Solver (1957) solved problems by searching through a space of symbolic states. Their Physical Symbol System Hypothesis claimed that symbol manipulation is both necessary and sufficient for general intelligence. This answered the representation question with: symbols and rules.
McCarthy created Lisp (1958), a programming language where code and data share the same structure. Lisp became the lingua franca of AI research for three decades.
Samuel's checkers program (1959) used self-play and temporal difference methods. This was an early instance of reinforcement learning, though the term came later.
Era 4: The Perceptron Controversy (1969)
Perceptron Limitation (Minsky-Papert 1969)
Statement
A single-layer perceptron cannot compute the XOR function or any function that requires examining the conjunction of input features that are not presented as explicit inputs. More precisely, certain predicates (like connectivity of a graph) require perceptrons of unbounded order to compute.
Intuition
A single-layer perceptron computes a linear threshold function. XOR is not linearly separable in 2D. Minsky and Papert proved a more general result: certain geometric predicates require looking at all inputs simultaneously, which a bounded-order perceptron cannot do.
Proof Sketch
The proof uses a group-invariance argument. If a predicate is invariant under a group of permutations of the input, then any perceptron computing it must have order at least as large as the minimal support of the invariance group. For connectivity, this requires order equal to the number of edges.
Why It Matters
This result is commonly mischaracterized. Minsky and Papert did not prove that neural networks are useless. They proved limitations of single-layer perceptrons. They noted explicitly that multi-layer networks might overcome these limits, but nobody knew how to train them at the time.
Failure Mode
The theorem says nothing about multi-layer networks with hidden units. A two-layer network with one hidden unit can compute XOR. The missing piece was not theory but a training algorithm for hidden layers.
The effect of Minsky and Papert's book was a reduction in neural network funding, but the technical contribution was honest: single-layer networks have real limitations, and multi-layer training was an open problem.
Era 5: Expert Systems (1974-1990)
DENDRAL (Feigenbaum, Lederberg, 1965) inferred molecular structure from mass spectrometry data using hand-coded chemical rules. MYCIN (Shortliffe, 1976) diagnosed bacterial infections using production rules with certainty factors.
Expert systems answered the representation question with: if-then rules elicited from domain experts. They answered uncertainty with: ad hoc certainty factors (not probability).
The CYC project (Lenat, 1984) attempted to encode all common-sense knowledge as logical assertions. After decades of work, CYC demonstrated that hand-coding knowledge does not scale. The number of edge cases in common sense appears to grow faster than any team can write rules.
AI winters were not purely technical
The first AI winter (1974) and the second (1987-1993) were driven by funding disappointment as much as by technical failure. Expert systems worked well in narrow domains. The problem was that advocates promised generality and delivered brittleness. The gap between promise and delivery caused funders to withdraw.
Era 6: The Statistical and Probabilistic Reform (1985-2005)
Hidden Markov Models brought probability to speech recognition (Jelinek at IBM, 1970s-1980s). Bayesian networks (Pearl, 1988) provided a representation for probabilistic reasoning that could be computed tractably for sparse dependency structures.
This era answered the uncertainty question with: probability theory, not certainty factors. It answered representation with: directed graphical models. It answered tractability with: conditional independence and message passing.
Pearl's contribution was not just Bayesian networks but a principled framework for causal reasoning that went beyond statistical association.
Statistics did not replace AI
A common misconception is that the statistical turn was a rejection of AI. It was the opposite: AI researchers adopted statistical methods to handle uncertainty rigorously instead of with ad hoc certainty factors. The researchers were often the same people.
Era 7: The Connectionist Return (1986-2012)
Rumelhart, Hinton, and Williams (1986) popularized backpropagation for training multi-layer networks. Backpropagation had been discovered multiple times before (Werbos 1974, Linnainmaa 1970), but the 1986 paper demonstrated it on problems that mattered.
LeCun (1989) applied convolutional networks and backpropagation to handwritten digit recognition. Hochreiter and Schmidhuber (1997) introduced LSTMs to address vanishing gradients in recurrent networks.
Bengio, LeCun, and Hinton persisted through a period (roughly 1995-2010) when neural network research was unfashionable. SVMs and kernel methods dominated machine learning conferences. The vindication came when increased compute and data made deep networks competitive.
Era 8: The Scale Era (2012-Present)
AlexNet (Krizhevsky, Sutskever, Hinton, 2012) won ImageNet by a large margin using a GPU-trained convolutional network. This was not a new architecture. It was LeNet with more layers, more data (ImageNet), and GPU training.
The Transformer (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallel training over sequences. This architecture change made it tractable to train on billions of tokens.
GPT-2 (2019) and GPT-3 (2020) demonstrated that language models scaled to billions of parameters exhibit emergent capabilities not present at smaller scales. Scaling laws (Kaplan et al., 2020) quantified the relationship between compute, data, parameters, and loss.
Deep learning did not make structure irrelevant
The Transformer is highly structured: it uses positional encodings, multi-head attention with learned projections, layer normalization, and residual connections. The claim that deep learning replaces feature engineering with end-to-end learning is half true. It replaces input feature engineering but introduces architectural structure that encodes inductive biases about sequence processing.
What History Gets Wrong
Myth 1: AI began as purely symbolic. The founding coalition included neural network researchers. Minsky's thesis was about neural nets. The symbolic-vs-connectionist framing became dominant later.
Myth 2: Minsky proved neural networks impossible. Minsky and Papert proved limitations of single-layer perceptrons. They acknowledged that multi-layer networks might work, but training them was an open problem.
Myth 3: Deep learning made structure irrelevant. Modern architectures are heavily structured. The inductive biases in Transformers (attention over all positions, residual connections) are design choices, not emergent properties.
Myth 4: Current methods are the endpoint. Grokking (Power et al., 2022) showed that neural networks can suddenly generalize long after memorizing training data. Training dynamics still produce surprises. The history of AI is a history of confident predictions about what works, followed by corrections.
Grokking as a Modern Reminder
Grokking is the phenomenon where a network trained on a small dataset first memorizes the training data (zero training loss, random test performance), then after much further training, suddenly achieves perfect generalization. This was observed on modular arithmetic tasks.
Grokking is a reminder that our understanding of training dynamics is incomplete. Classical learning theory predicts that a model that memorizes should not later generalize. The explanation appears to involve a phase transition where the network finds a simpler representation after initially relying on memorization. This connects back to the recurring question of representation: the network discovers a better internal representation given enough training time.
Summary
- Five questions recur across all eras: representation, learning, search, uncertainty, tractability
- The symbolic-connectionist divide was not present at the founding
- AI winters were driven by expectation gaps, not purely by technical failure
- The statistical reform brought principled uncertainty handling
- Scale (data + compute + architecture) changed what is tractable, but did not eliminate the need for structure
- Training dynamics still surprise us (grokking, double descent)
Exercises
Problem
For each of the five recurring questions (representation, learning, search, uncertainty, tractability), name the dominant answer in the expert systems era (1974-1990) and the scale era (2012-present).
Problem
Minsky and Papert showed single-layer perceptrons cannot compute XOR. Construct a two-layer network (with exact weights) that computes XOR on inputs .
References
Canonical:
- Turing, "On Computable Numbers" (1936)
- McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity" (1943)
- Minsky & Papert, Perceptrons (1969), Chapters 1-5, 11-13
- Rumelhart, Hinton & Williams, "Learning Representations by Back-propagating Errors" (1986)
Historical surveys:
- Nilsson, The Quest for Artificial Intelligence (2010), Chapters 1-20
- Crevier, AI: The Tumultuous History of the Search for Artificial Intelligence (1993)
- Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2021), Chapter 1
Current:
- Vaswani et al., "Attention Is All You Need" (2017)
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
- Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (2022)
Next Topics
- Model timeline: chronological reference of specific architectures
- Scaling laws: quantitative relationships between compute, data, and loss
Last reviewed: April 2026