Model Timeline
Gemini and Google Models
Google's model lineage from PaLM through Gemini 2.0: native multimodality, extreme long context, TPU infrastructure, and the Gemma open-weight series.
Prerequisites
Why This Matters
Google invented the transformer (Vaswani et al., 2017) and has trained some of the largest models in the field. The Gemini family is distinguished by two capabilities: native multimodality (trained jointly on text, images, audio, and video from the start, not retrofitted) and extreme long context (1M+ tokens). Google also controls the TPU hardware stack used for training, giving it vertical integration that other labs lack.
Pre-Gemini: PaLM and PaLM 2
PaLM (April 2022)
Parameters. 540 billion (dense). Training. 780 billion tokens on 6144 TPU v4 chips.
Key contributions. Demonstrated breakthrough performance on reasoning tasks, especially with chain-of-thought prompting. Showed "emergent abilities" at scale: tasks where performance was near-random for smaller models but jumped sharply at 540B parameters. These included multi-step arithmetic, logical deduction, and some forms of common-sense reasoning.
Caveat on emergence. Later work (Schaeffer et al., 2023) argued that many "emergent" abilities are artifacts of non-linear evaluation metrics rather than genuine phase transitions. The claim of sharp emergence remains debated.
PaLM 2 (May 2023)
Improved multilingual performance, reasoning, and coding over PaLM. Used a "compute-optimal" training approach (more tokens relative to parameters, following the Chinchilla scaling insight). Exact parameter count not disclosed but believed to be smaller than PaLM-540B while performing better. Powered the initial versions of Google Bard (later renamed Gemini).
Gemini 1.0 (December 2023)
Three model sizes:
- Nano: small models (1.8B and 3.25B) designed for on-device inference on mobile phones (Pixel 8 Pro). Distilled from larger Gemini models.
- Pro: mid-tier, comparable to GPT-3.5/GPT-4 Turbo. The default for Google's AI products.
- Ultra: largest and most capable. Claimed to match or exceed GPT-4 on most benchmarks at launch.
Native multimodality. Unlike GPT-4 (which bolted a vision encoder onto a text model) or CLIP-based approaches, Gemini 1.0 was trained from the start on interleaved text, images, audio, and video data. The model processes all modalities through a unified architecture rather than separate encoders connected by adapters.
Training infrastructure. Trained on TPU v4 and v5e pods. Google's control over both hardware and software (JAX, XLA compiler) enables training optimizations not available to labs using Nvidia GPUs with third-party frameworks.
Gemini 1.5 (February 2024)
Gemini 1.5 Pro
Context length. Up to 1 million tokens, later extended to 2 million in experimental settings. At launch, this was 8x larger than any competitor's production context window.
Architecture. Confirmed to use a mixture-of-experts architecture (details sparse). The MoE design enables processing long sequences more efficiently than a dense model of equivalent quality.
What 1M tokens means in practice. A 1M token context can hold: approximately 700K words of text (several books), 1 hour of video, 11 hours of audio, or an entire medium-sized codebase. This is not just a quantitative improvement; it enables qualitatively different use cases. You can process an entire repository, analyze a full legal contract, or answer questions about a movie without chunking.
Quadratic Attention Cost for Long Sequences
Statement
Standard self-attention computes a full attention matrix, requiring FLOPs and memory per layer. For tokens, the attention matrix has entries per layer. Without architectural modifications (sparse attention, linear attention, or MoE routing), processing a 1M token sequence with standard dense attention is computationally prohibitive.
Intuition
Every token attends to every other token. Doubling the sequence length quadruples the attention cost. At 1M tokens, the naive attention matrix is a trillion-entry matrix per layer. Practical long-context models must use some form of approximation, sparsity, or compression to make this tractable.
Proof Sketch
The attention computation is . The matrix has dimensions . Computing it requires multiplications. Storing it requires memory. For , .
Why It Matters
Long-context capability requires either accepting the quadratic cost (feasible only with massive parallelism and hardware budget) or using approximations. Gemini 1.5's ability to handle 1M+ tokens implies significant engineering around attention efficiency, possibly including sparse attention patterns, KV cache compression, or sliding-window approaches combined with global attention at intervals.
Failure Mode
Quadratic attention cost means that even if a model supports 1M tokens in theory, the latency and cost scale quadratically. A 1M token prompt costs roughly 100x more than a 100K token prompt (for the attention component). Users must weigh whether the benefit of full-context processing justifies the cost versus chunking strategies that process segments independently.
Gemini 1.5 Flash
A smaller, faster variant optimized for latency and cost. Designed for high-volume applications where speed matters more than maximum quality. Comparable to Claude 3 Haiku and GPT-4o-mini in the cost/performance trade-off.
Gemini 2.0 (December 2024)
Gemini 2.0 Flash. The initial Gemini 2.0 release focused on the Flash tier, with improved reasoning, coding, and multimodal capabilities over 1.5 Flash. Introduced native tool use and agentic capabilities.
Multimodal output. Gemini 2.0 can generate images and audio natively (not just accept them as input). This moves toward a unified model that handles all modalities in both directions.
Gemma (Open Weights)
Gemma 1 (February 2024). Open-weight models at 2B and 7B parameters. Trained on 6T tokens (2B) and 6T tokens (7B) of web, code, and math data. Based on Gemini architecture but scaled down. Released under a permissive license allowing commercial use.
Gemma 2 (June 2024). Sizes: 2B, 9B, 27B. Improved quality through knowledge distillation from larger models and better training recipes. The 27B model is competitive with much larger open models on many benchmarks.
Gemma 3 (2025). Extended the family with additional sizes and multimodal variants. Open-weight models that can process images alongside text.
Significance. Gemma gives Google a presence in the open-weight ecosystem alongside Meta's Llama and Alibaba's Qwen. The smaller sizes (2B, 7B) are particularly useful for on-device deployment and resource-constrained settings.
Google's Infrastructure Advantage
TPUs. Google designs its own training hardware (Tensor Processing Units). TPU v4 and v5 pods provide large-scale interconnected compute clusters optimized for transformer training. This vertical integration (hardware + compiler + framework + model) gives Google a unique position: they do not depend on Nvidia's GPU supply chain.
Data. Google has proprietary access to web-scale data (Search index, YouTube transcripts, Google Books, Google Scholar). The quality and diversity of training data is a significant but largely unquantifiable advantage.
Integration. Gemini is integrated into Google Search, Gmail, Docs, and other products with billions of users. This provides a distribution advantage and a feedback loop (user interactions can inform future training).
Comparison with GPT and Claude
- Multimodality. Gemini's native multimodal training is architecturally different from GPT-4's approach of connecting separate vision and text modules. Whether this produces meaningfully better multimodal performance depends on the specific task.
- Long context. Gemini 1.5 Pro's 1M+ token context exceeds competitors by a large margin (Claude: 200K, GPT-4: 128K). For tasks that require processing very large documents, this is a clear advantage.
- Open weights. Gemma provides open-weight models, but the frontier Gemini models are closed. This parallels Claude (no open weights) and differs from DeepSeek and Meta (open weights for frontier models).
- Reasoning. On standard benchmarks, Gemini Ultra/Pro is competitive with GPT-4 and Claude Opus. Relative rankings vary by benchmark and change with each release.
Common Confusions
Native multimodality does not automatically mean better vision
Being trained jointly on text and images does not guarantee superior visual understanding. A model trained separately on vision and text with a good adapter can perform comparably on many tasks. The advantage of native multimodality is more subtle: the model can learn cross-modal relationships during pretraining (e.g., a graph and its caption jointly), which may help on tasks requiring tight integration between modalities.
1M context does not mean the model uses all 1M tokens equally well
Research on long-context models shows that performance can degrade for information placed in the middle of very long contexts (the "lost in the middle" phenomenon). Gemini 1.5 Pro handles this better than most models but is not immune. Having 1M tokens of context does not guarantee that the model attends equally to all positions.
Exercises
Problem
Compute the memory required to store the attention matrix for a single layer of standard self-attention with sequence length tokens, using float32 (4 bytes per entry). How does this compare to the memory of a 70B parameter model in float16?
Problem
Gemma 2B was trained on 6T tokens. Using the Chinchilla scaling law (optimal compute allocation: where is parameters and is tokens), is 2B parameters with 6T tokens compute-optimal? If not, which direction is the imbalance, and what does this imply about training strategy?
References
Canonical:
- Gemini Team, "Gemini: A Family of Highly Capable Multimodal Models" (Google DeepMind, 2023)
- Gemini Team, "Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context" (2024)
Current:
- Chowdhery et al., "PaLM: Scaling Language Modeling with Pathways" (2022)
- Gemma Team, "Gemma: Open Models Based on Gemini Research and Technology" (2024)
- Schaeffer et al., "Are Emergent Abilities of Large Language Models a Mirage?" (2023), NeurIPS
Next Topics
- Model comparison table: structured comparison across frontier model families
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1