Model Timeline
Qwen and Chinese Models
The Chinese open-weight model ecosystem: Qwen (Alibaba), Yi (01.AI), Baichuan, GLM (Zhipu AI), and Kimi (Moonshot AI), with a focus on multilingual capability and independent scaling.
Prerequisites
Why This Matters
China has developed an independent LLM ecosystem that is competitive with Western models on many benchmarks. Qwen (Alibaba) is the most prominent family, with open weights, strong multilingual support, and sizes from sub-billion to 235B MoE. Understanding this ecosystem matters because it represents a parallel development path with different data, hardware constraints (US export controls on top-tier GPUs), and design priorities.
Qwen Family (Alibaba Cloud)
Qwen 1 (August 2023)
Sizes. 1.8B, 7B, 14B, 72B. Training data. Over 2.4T tokens, with strong coverage of Chinese and English. Architecture. Dense decoder-only transformer with RoPE, SwiGLU activations, and RMSNorm.
Key feature. Multilingual tokenizer with a large vocabulary (approximately 150K tokens) covering Chinese, English, and code. Chinese-optimized tokenizers are significantly more efficient for Chinese text than tokenizers designed primarily for English (like LLaMA's).
Qwen 1.5 (February 2024)
Incremental improvement over Qwen 1. Better alignment, improved long-context handling (up to 32K tokens for larger sizes). Released under a more permissive license (Apache 2.0 for smaller sizes).
Qwen 2 (June 2024)
Sizes. 0.5B, 1.5B, 7B, 57B-A14B (MoE, 57B total with 14B active), 72B. Training data. Over 7T tokens. Context length. Up to 128K tokens for the 7B and 72B models.
Improvements. Significant quality gains across math, coding, and reasoning. The 72B model was competitive with Llama 3 70B on English benchmarks and stronger on Chinese. The MoE variant (57B-A14B) demonstrated that Alibaba was also pursuing sparse architectures alongside dense models.
Qwen 2.5 (September 2024)
Sizes. 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B. Training data. Over 18T tokens (a major increase from Qwen 2's 7T). Specialized variants: Qwen2.5-Coder for code generation and Qwen2.5-Math for mathematical reasoning.
Result. Qwen 2.5 72B was competitive with Llama 3.1 70B and Claude 3.5 Sonnet on several benchmarks. The breadth of model sizes (from 0.5B to 72B) makes the family useful across deployment scenarios from edge devices to servers.
Qwen 3 (2025)
Extended the family with a 235B MoE model (estimated 22B active per token). Continued improvements in reasoning, code, and multilingual performance. The MoE variant follows the pattern established by DeepSeek-V2/V3: large total parameter count with a fraction active per token.
Tokenizer Efficiency
Tokenizer Compression Ratio for Multilingual Text
Statement
The compression ratio of a BPE tokenizer is the ratio of characters (or bytes) to tokens for a given text. A tokenizer trained primarily on English text achieves a compression ratio of approximately 4:1 for English (4 characters per token) but only 1.5-2:1 for Chinese (1.5-2 characters per token), because Chinese characters are underrepresented in the vocabulary. A tokenizer with adequate Chinese coverage achieves 2.5-3.5:1 compression for Chinese. Since API pricing and context limits are denominated in tokens, a 2x difference in compression ratio means 2x higher effective cost and 2x shorter effective context for the same text.
Intuition
BPE builds vocabulary by merging frequent character pairs. If the training corpus is mostly English, most vocabulary entries represent English subwords. Chinese characters that appear less frequently in training get fewer vocabulary entries, so each Chinese character often maps to multiple tokens. Qwen's tokenizer allocates more vocabulary to Chinese, giving better compression for Chinese text.
Proof Sketch
Empirical measurement. Tokenize the same Chinese text with LLaMA's tokenizer (32K vocab, English-heavy) and Qwen's tokenizer (150K vocab, Chinese-balanced). LLaMA produces roughly 2x more tokens for the same Chinese text. The theoretical basis is that BPE compression approaches the empirical entropy of the training corpus distribution; if Chinese is underrepresented, the tokenizer's "language model" of Chinese is poor, yielding longer encodings.
Why It Matters
For a multilingual application, tokenizer efficiency directly affects cost and capability. If your tokenizer uses 2x as many tokens for Chinese as for English, then Chinese users get half the effective context window and pay twice as much per character. Qwen's large multilingual vocabulary addresses this directly.
Failure Mode
A larger vocabulary increases the embedding matrix size () and the output projection size. For a 150K vocabulary with , the embedding and output matrices together use B parameters. For smaller models (0.5B-1.5B), this is a significant fraction of total parameters. There is a trade-off between tokenizer efficiency and model parameter allocation.
Other Chinese Model Families
Yi (01.AI)
Organization. 01.AI, founded by Kai-Fu Lee. Models. Yi-6B, Yi-34B (November 2023), Yi-1.5 (2024) with sizes up to 34B. Yi-Lightning and Yi-Large (2024) as proprietary API models.
Approach. High-quality English and Chinese performance from relatively small models. Yi-34B at launch was competitive with much larger models on both English and Chinese benchmarks. The focus is on data quality over raw scale: careful curation and filtering of training data.
Baichuan (Baichuan Intelligence)
Models. Baichuan-7B, Baichuan-13B (2023), Baichuan 2 (2023). Open-weight models with strong Chinese performance. Less prominent internationally than Qwen or Yi but widely used in Chinese industry applications.
GLM (Zhipu AI)
Models. ChatGLM, ChatGLM2, ChatGLM3, GLM-4 (2024), GLM-5 (February 2026). Based on the General Language Model architecture, which originally used a bidirectional attention prefix followed by autoregressive generation. GLM-4 was competitive with GPT-4 on Chinese benchmarks.
GLM-5 represents a significant jump. It is a native multimodal model (text, image, video, audio, code in a single architecture), supports long context (128K tokens), and includes built-in tool use and agentic capabilities. The technical report (Zhipu AI, 2026) describes a three-stage training pipeline: multimodal pretraining, supervised fine-tuning with diverse instruction data, and RLHF with both human and AI feedback. GLM-5 is competitive with GPT-4o and Claude 3.5 on multimodal benchmarks and demonstrates strong reasoning on math and coding tasks. Zhipu AI released it under a permissive license, continuing the Chinese open-weight trend.
Kimi (Moonshot AI)
Key innovation. Long context and reasoning. Kimi launched with 200K token context support in early 2024, making it one of the first consumer-facing products with context windows beyond 100K tokens. In 2025, Kimi k1.5 introduced multi-step reasoning with reinforcement learning, following the approach pioneered by DeepSeek-R1 and OpenAI o1. Kimi demonstrated that long context combined with chain-of-thought reasoning creates qualitatively different user experiences: the model can read an entire codebase and reason about architectural changes across files.
Other Notable Models (2025-2026)
MiniMax-01 (MiniMax, 2025): 456B parameter MoE model with "lightning attention" (a hybrid of linear attention and softmax attention). Strong on long-context tasks.
Step-2 (StepFun, 2025): Multimodal model with native video understanding. One of the first Chinese models to reach strong video comprehension scores on public benchmarks.
Seed (ByteDance, 2025): TikTok's parent company entered the frontier model race. Limited public details but deployed in ByteDance products.
The Chinese Open-Weight Ecosystem
Several factors shape the Chinese model landscape:
Hardware constraints. US export controls restrict Chinese labs' access to Nvidia H100 GPUs. Labs use H800 (a lower-bandwidth variant) or domestic alternatives. This may incentivize efficiency-focused architectures (MoE, quantization-friendly designs) and is one hypothesis for why Chinese labs like DeepSeek have led on MoE innovation.
Data landscape. Chinese internet data is distinct from English internet data in scale, quality distribution, and content. Models trained primarily on Chinese data perform better on Chinese tasks but may lag on English, and vice versa. Qwen's strategy of training on balanced multilingual data addresses this directly.
Regulatory environment. China's AI regulations require government review of public-facing AI services. This affects how models are deployed commercially but has not slowed research output.
Open-weight norms. Qwen, Yi, Baichuan, and DeepSeek all release model weights, often under permissive licenses. The Chinese open-weight ecosystem is growing independently of and in parallel with the Western ecosystem (Llama, Mistral, Gemma).
Common Confusions
Chinese models are not just for Chinese language
Qwen 2.5 72B performs competitively with Llama 3.1 70B on English benchmarks. Modern Chinese models are multilingual and competitive across languages. The "Chinese model" label refers to the organization's origin, not a language limitation.
Open weights does not mean identical training
Qwen, Yi, and DeepSeek all release weights, but their training data, procedures, and filtering criteria differ. The models are not interchangeable; they have different strengths even at similar parameter counts. Evaluating on your specific task is necessary.
Exercises
Problem
A tokenizer with 32K vocabulary tokenizes Chinese text at a compression ratio of 1.5 characters per token. A tokenizer with 150K vocabulary achieves 3.0 characters per token. If a model has a 128K token context window, how many Chinese characters can each tokenizer fit? What is the equivalent in approximate page count (assuming 800 Chinese characters per page)?
Problem
Qwen 2.5 has a vocabulary of approximately 150K tokens with embedding dimension (for the 72B model). Compute the number of parameters in the embedding layer and the output projection (LM head). What fraction of the total 72B parameters do these represent? Compare to Llama 3 70B with a 128K vocabulary and .
References
Canonical:
- Bai et al., "Qwen Technical Report" (Alibaba, 2023)
- Yang et al., "Qwen2 Technical Report" (2024)
- Qwen Team, "Qwen2.5 Technical Report" (2024)
Current:
- 01.AI, "Yi: Open Foundation Models by 01.AI" (2024)
- Zhipu AI, "ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4" (2024)
- Zhipu AI, "GLM-5 Technical Report" (arXiv:2602.15763, February 2026). Native multimodal, 128K context, agentic tool use.
- Kimi Team, "Kimi k1.5: Scaling Reinforcement Learning with LLMs" (2025)
Next Topics
- Model comparison table: structured comparison across frontier model families
- DeepSeek models: the MoE architecture and R1 reasoning approach
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1