Paper breakdown
Kimi K2: Open Agentic Intelligence
Kimi Team · 2026 · arXiv preprint (technical report)
1.04 trillion total / 32 billion activated parameter Mixture-of-Experts LLM trained on 15.5T tokens with the MuonClip optimizer (Muon plus per-head QK-Clip post-update weight rescaling). Pairs ultra-sparse MoE with multi-head latent attention from DeepSeek-V3 and a synthetic agentic data pipeline. Top open-weight model on Tau2-bench, ACEBench, and SWE-bench Verified at release.
Overview
Moonshot AI's Kimi K2 (2026) is a 1.04 trillion-parameter Mixture-of-Experts (MoE) language model with 32 billion activated parameters, pre-trained on 15.5 trillion tokens with no observed loss spike across the entire run. The technical report's three engineering contributions are:
- MuonClip, an optimizer that combines the Muon update (Newton-Schulz-orthogonalized gradient with RMS rescaling) with a post-update weight clip on attention query and key projections (QK-Clip) to prevent the attention-logit growth that destabilizes Muon at trillion-parameter scale.
- A synthetic agentic data pipeline that produces tools, thousands of agents, multi-turn rubric-evaluated trajectories, and a hybrid simulated-plus-real-execution sandbox, used both for SFT and as the source of verifiable rewards.
- A self-critique rubric reward that extends RL-from-verifiable-reward (RLVR) to subjective tasks by having the model rank its own outputs against curated rubrics in a closed loop with the policy.
Architecturally, K2 follows the DeepSeek-V3 template but pushes sparsity further: 384 experts (versus 256 in DeepSeek-V3) with the same 8 active per token, halving the attention-head count from 128 to 64, and reducing the number of dense layers to 1. The model uses Multi-head Latent Attention (MLA) for cache compression. The result is the top open-source non-thinking model on Tau2-bench (66.1), ACEBench-En (76.5), and SWE-bench Verified (65.8) at release, and the fifth-overall LMSYS Arena entry across user votes.
The paper is a technical report, not a methodologically novel contribution. Its weight is in the demonstrated stability of the training run at trillion-parameter MoE scale and in the public release of base and post-trained checkpoints. Most prior open-weight models at this scale (DeepSeek-V3, Llama 4-class) have been AdamW-trained; Kimi K2 is the first open release showing that Muon-family optimization can be made stable at this scale with a small post-hoc fix.
Mathematical Contributions
Architecture: ultra-sparse MoE with MLA
K2 has 61 transformer layers, 1 dense and 60 MoE. Each MoE layer has 384 experts plus 1 shared expert; the gating function selects 8 experts per token, giving sparsity ratio For a fixed activated-parameter budget, the paper's sparsity scaling law (their Section 2.3 and Figure 5) shows validation loss continues to drop monotonically as sparsity grows: at the compute-optimal validation loss of , sparsity- uses fewer FLOPs than sparsity-. The cost is a more complex routing infrastructure and bigger memory footprint, both worth it at this scale.
The attention is Multi-head Latent Attention (MLA, from DeepSeek-V2/V3), which factorizes the key and value projections through a low-rank latent: where is a -dimensional latent vector cached per token instead of the full key and value. This gives roughly KV-cache compression versus full multi-head attention with the same head dimension. K2 uses 64 attention heads (DeepSeek-V3 used 128); the paper's Figure 6 ablation shows doubling heads from 64 to 128 only buys – validation-loss improvement at the cost of an increase in inference FLOPs at -token context — the head count is reduced for inference efficiency, especially in long-context agentic workloads.
The Muon update
Muon (Jordan et al., 2024) replaces AdamW's diagonal preconditioner with a Newton-Schulz orthogonalization of the momentum-smoothed gradient. For each weight : The orthogonalization projects the update onto the Stiefel manifold of orthogonal matrices, and the factor is chosen so that matches what AdamW would produce at the same point. The Newton-Schulz iteration is a polynomial of that converges in steps to the orthogonalized matrix without an SVD. Empirically, at constant compute and data, Muon outperforms AdamW substantially on token efficiency for standard transformer training.
Why scaling Muon is hard: attention-logit explosion
The paper documents an instability that appears more frequently with Muon than with AdamW: per-head max attention logits grow without bound during training. In a mid-scale 9B-activated, 53B-total experiment with vanilla Muon, the per-head max logit crossed within steps and triggered loss spikes (paper Figure 2 left). Logit soft-cap (Gemma's fix) directly clips the logit but allows the underlying products to keep growing; Query-Key Normalization (QK-Norm) normalizes but is incompatible with MLA because the full key matrix is never materialized.
QK-Clip: post-update weight rescaling
The contribution is QK-Clip: after each optimizer step, for any head whose forward-pass max logit exceeded a threshold (the paper uses ), rescale the query and key projection weights to bring it back below on the next forward pass.
Per-head scale factor: For a regular multi-head attention layer: with by default (split the rescaling symmetrically between and ). For MLA, the head-shared rotary key is left untouched to avoid coupling heads, and the head-specific components are each scaled by while the head-specific rotary is scaled by .
Crucially, QK-Clip does not alter the forward pass that already happened — the gradient on this step is computed against the unclipped weights. The clip applies post-update, treating the observed as a guiding signal for next-step weight magnitude. The condition guarantees that the rescaling is monotone-decreasing, so it acts as a one-sided projection onto the constraint set .
The full optimizer is with from Muon. Over the entire 15.5T-token K2 training run, the max logit hits in roughly 30% of steps in the first phase, then decays naturally below as the optimization stabilizes; QK-Clip stops being active for the remainder of training (paper Figure 2 right).
Pre-training data: rephrasing for token efficiency
K2's 15.5T tokens are enriched by a synthetic-rephrasing pipeline. The premise is that a single pass over knowledge-rich text provides incomplete absorption while many passes overfit; rephrasing the same text into many stylistic variants amplifies effective tokens without overfitting. Knowledge data is processed by:
- Style- and perspective-diverse prompting (extending WRAP, Maini et al., 2024): an LLM is prompted to rephrase a passage in different writing styles and from different perspectives.
- Chunk-wise autoregressive generation: long passages are split into -token chunks; each chunk is rephrased autoregressively conditioned on the previous chunk's rephrased output, preserving global coherence.
- Fidelity verification: a separate model checks semantic equivalence of the rephrased chunk to the original.
Empirically (paper Table 1), at fixed compute, rephrasing once and training for 10 epochs scores on SimpleQA versus raw-data 10-epoch at ; rephrasing 10 times for one epoch improves further to . The mathematics rephrasing follows a "learning-note" style adapted from SwallowMath (Fujii et al., 2024).
RL algorithm: KL-regularized off-policy importance-style objective
K2 inherits the policy-optimization objective from K1.5. For each prompt from the dataset , responses are sampled from the previous policy . The objective is where is the per-prompt mean reward used as a control variate, and is a KL-regularization coefficient. This is a squared error against an advantage-style target minus a KL penalty toward the old policy. Unlike PPO, it has no clipping; unlike DPO, it admits arbitrary scalar rewards rather than only pairwise preferences.
The optimizer for the RL stage is also Muon (with QK-Clip carried over), which the paper claims is the first published RL run at this scale on a Muon-family optimizer. Three additional regularizers stabilize the run:
- Token-budget penalty. Each task is assigned a max-response budget ; responses exceeding are truncated and assigned a penalty. This counteracts the well-known "RL makes responses longer" effect on tasks where verbosity is not actually rewarded.
- PTX auxiliary loss. A small fraction of pre-training-quality tokens are mixed into the RL gradient batch, weighted by an auxiliary PTX loss term, to prevent forgetting of the SFT distribution.
- Temperature decay. Sampling temperature starts high (encouraging exploration on creative-writing and reasoning tasks) and decays linearly toward over training, shifting from exploration to exploitation.
Self-critique rubric reward
For non-verifiable tasks (creative writing, open-ended dialogue, faithfulness), K2 introduces a Self-Critique Rubric Reward. The model itself acts as a critic: given a prompt and candidate responses, it ranks them by performing pairwise comparisons against three rubric sets: core rubrics (the model's invariant identity values), prescriptive rubrics (anti-reward-hacking constraints), and human-annotated rubrics (instruction-specific). The pairwise comparisons are aggregated into a scalar reward via Bradley-Terry-style ranking.
The critic is bootstrapped from a curated open-source plus in-house preference dataset during SFT. During RL, the critic itself is closed-loop refined: on-policy rollouts from verifiable-reward prompts continually update the critic via supervised loss against the verifiable reward, transferring objective performance signal into the critic's evaluation. Over training, the critic and the policy improve together; the critic ratchets its evaluation bar in lockstep with policy quality. This is the K2-specific implementation of constitutional-AI-style self-supervised alignment, with the difference that the critic gets signal from verifiable tasks in real time rather than from a fixed constitution.
Agentic data synthesis
For tool-use post-training, K2's pipeline (paper Section 3.1, Figure 8) constructs tools by combining real Model Context Protocol (MCP) tools from GitHub with synthetic tools generated through a hierarchical domain ontology. From this tool repository it synthesizes thousands of distinct agent system prompts, each equipped with a different tool subset, then generates rubric-paired tasks per agent, then runs a multi-agent simulation (user agent, target agent, tool simulator, judge agent) to produce trajectories. A judge agent filters trajectories against the task rubric; only successful ones enter SFT data. For coding and software engineering, simulated tool execution is replaced with real Kubernetes-backed sandboxes that run actual code and unit tests. The combined output trains both the SFT distribution and provides verifiable-reward signal for the RL stage.
What Kimi K2 does not claim
It is not a chain-of-thought ("thinking") model in the o1/DeepSeek-R1 sense. The reported numbers are non-thinking baselines. The paper notes that thinking-mode extensions are an active area but not the focus.
It does not provide a sharp ablation isolating MuonClip from the rest of the architectural and data choices. The reported gains on standard benchmarks are joint effects of optimizer, sparsity, MLA, data rephrasing, and post-training. The optimizer stability claim ("zero loss spikes across 15.5T tokens") is the cleanest single-variable claim; the benchmark numbers are not.
It does not provide a learned-reward-model RL recipe (RLHF in the Christiano-2017 sense). The reward signal is verifiable for math/code/instruction-following, and self-critique-rubric for everything else. There is no separately trained reward model.
Connections to TheoremPath Topics
- Mixture of experts — the architectural backbone; sparsity-48 is at the high end of currently published configurations.
- Optimizer theory: SGD, Adam, Muon — Muon is the immediate predecessor; MuonClip is what makes it work at trillion-parameter scale.
- Preconditioned optimizers — the Newton-Schulz orthogonalization is a structured preconditioner that approximates the spectrum-flattening effect of full second-order methods.
- DeepSeek models — the architectural template K2 inherits (MLA, MoE design).
- Llama and open-weight models — peer family at the open-weight frontier.
- Attention variants and efficiency — Multi-head Latent Attention is one of the leading efficient-attention variants for inference.
- Reinforcement Learning from Human Feedback (deep dive) — the broader algorithmic family the K2 RL stage belongs to.
- Agentic RL and tool use — the tool-use post-training story; K2 is the most-detailed open-weight account of agentic SFT and RL.
- Post-training overview — the broader pipeline (SFT, rejection sampling, RL with verifiable and critic rewards).
- RLHF and alignment — the self-critique-rubric reward is in the constitutional-AI lineage.
Why It Matters Now
K2 is the first open-weight release at trillion-parameter MoE scale to use a non-AdamW optimizer end-to-end and provide a complete recipe for stabilizing it. AdamW dominates the open frontier through inertia rather than evidence; smaller-scale Muon studies have shown 20–40% token-efficiency gains, but transferring those gains to trillion-parameter MoE training has been blocked on training-instability questions. MuonClip is the first published answer that holds across the full scale. The next round of open-weight pretraining runs will plausibly default to MuonClip-style optimizers; the AdamW monopoly is over.
The agentic post-training recipe is also worth flagging. The dominant open-weight pattern through 2024–2025 was: pretrain on web text, SFT on cleaned instruction data, RLHF or DPO on preference data. Tool-use was added downstream as fine-tuning. K2 inverts this — synthetic agent trajectories with real-execution sandbox grounding are a first-class part of the post-training distribution, not a downstream adapter — and the SWE-bench Verified score () suggests this pays off in production-grade software-engineering tasks where tool use is the bottleneck. Expect every serious open-weight LLM after K2 to have a similar pipeline.
The architectural lesson on sparsity is more specific. K2's FLOP reduction at sparsity- is consistent with smaller-scale MoE scaling laws, but pushing to sparsity- at trillion-parameter scale has serious infrastructure cost — the all-to-all communication for routing 8 active experts out of 384 across nodes is intricate to make efficient. Moonshot has demonstrated this is doable; the open question is whether sparsity- or higher can be made to work, and whether the FLOP-loss curve continues monotonic at those points or breaks.
The reduction in attention heads is a second architectural data point against the conventional wisdom (DeepSeek-V3's heuristic of "heads layers"). At the long-context regime where agentic workloads live, KV-cache and attention-FLOPs scale linearly in head count for the long-context part; the head count had been chosen for short-context bandwidth utilization, which is the wrong objective once context is . Halving heads is a small architectural change with a large inference-cost dividend.
References
Canonical:
- Kimi Team. (2026). "Kimi K2: Open Agentic Intelligence." arXiv preprint (technical report). arXiv:2507.20534. Model checkpoints: huggingface.co/moonshotai/Kimi-K2-Instruct.
Direct precursors (Muon and MoE):
- Jordan, K. et al. (2024). "Muon: An optimizer for hidden layers in neural networks." Blog post / arXiv. The base optimizer K2 builds on.
- Liu, A. et al. (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Source of MLA, the architecture template, and many of the MoE design choices.
- DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. Original MLA paper.
- Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arXiv:1701.06538. The MoE template.
Pre-training data and rephrasing:
- Maini, P. et al. (2024). "Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (WRAP)." ICLR 2024. arXiv:2401.16380. The rephrasing-for-token-efficiency idea K2 extends.
- Kimi Team. (2025). "Kimi k1.5: Scaling Reinforcement Learning with LLMs." arXiv:2501.12599. The K1.5 RL algorithm K2 inherits.
RL and alignment lineage:
- Christiano, P. F. et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741. Original RLHF.
- Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. The self-critique-rubric lineage.
- Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347. PPO; K2's KL-regularized squared-error objective is in the same family.
Attention-logit instability fixes (alternatives to QK-Clip):
- Team, Gemma et al. (2024). "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118. Logit soft-cap.
- Henry, A. et al. (2020). "Query-Key Normalization for Transformers." Findings of EMNLP 2020. arXiv:2010.04245. QK-Norm; not applicable to MLA.
Standard textbook:
- Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12 — transformers and attention; Chapter 16 — pretraining and post-training.
Connected topics
Last reviewed: May 6, 2026