Infrastructure
WebGPU for Machine Learning
WebGPU gives the browser an explicit GPU compute model: devices, queues, buffers, and WGSL kernels. That is the missing substrate for serious in-browser inference, custom kernels, and eventually browser-native training systems.
Why This Matters
If we want real browser-side ML artifacts, WebGPU is the substrate. It is the difference between a browser that can display model outputs and a browser that can actually run kernels for matmuls, attention, sampling, and rendering.
Older browser ML stacks had to tunnel through APIs that were designed for graphics rather than compute. WebGPU changes that. You get explicit buffers, command queues, compute pipelines, workgroups, and a shading language called WGSL designed for secure GPU execution on the web. That is why modern browser runtimes such as WebLLM and ONNX Runtime Web can do more than toy demos.
This matters directly for the roadmap we have been discussing:
- browser-side diffusion models,
- browser inference over open-weight models,
- kernel playgrounds and roofline demonstrations,
- 3D Gaussian Splatting,
- and eventually more ambitious artifacts such as activation engineering or training-scale simulations.
WebGPU is the browser layer where ML stops pretending to be graphics
The point is explicit compute: buffers, workgroups, command queues, and kernels written in WGSL. That is the substrate behind serious browser inference, custom kernels, and eventually training.
TypeScript / JS app
tokenizer, scheduler, UI, model loader
WebGPU API
device, queue, buffers, bind groups
WGSL kernels
matmul, softmax, layernorm, sampling
Metal / Vulkan / D3D12
native backend chosen by the browser
The web stack becomes viable only when the kernels are large enough to amortize upload cost, dispatch overhead, and intermediate materialization.
What WebGPU adds
Compute shaders, explicit buffers, command encoding, and predictable memory movement. That is why browser ML no longer has to pretend graphics APIs are tensor runtimes.
Where the cost still lives
Compilation latency, upload bandwidth, dispatch overhead, and VRAM pressure still dominate small or badly fused workloads. Browser GPU compute is real; it is not magic.
The systems lens
If a kernel chain makes full passes over a tensor of size bytes on a device with bandwidth , then runtime is lower-bounded by roughly before arithmetic even enters the conversation.
Mental Model
Think of WebGPU as the browser's safe wrapper around modern native GPU APIs. Your JavaScript or TypeScript code does not directly "run on the GPU." It allocates buffers, compiles WGSL shader modules, describes a compute pipeline, records commands, and submits those commands to a device queue. The GPU then executes the kernels in parallel.
That means browser ML performance is governed by the same systems questions that govern native GPU work:
- how many bytes cross memory,
- how many times intermediate tensors are materialized,
- whether kernels are fused or fragmented,
- whether the workload is large enough to amortize launch overhead,
- and whether the implementation respects the device's workgroup and memory structure.
Formal Setup
WebGPU
WebGPU is the browser API for modern GPU graphics and compute. The core objects for ML are:
- a device, which owns GPU resources and pipelines,
- a queue, which submits encoded work,
- buffers and textures, which hold data,
- bind groups, which describe which buffers a kernel may access,
- and compute pipelines, which pair a WGSL entry point with its layout.
The browser maps this model to native backends such as Metal, Vulkan, or Direct3D 12.
WGSL compute kernel
A WGSL compute kernel is a shader entry point that runs over a grid of invocations grouped into workgroups. A typical elementwise ML kernel looks like this conceptually: each invocation receives an element index, reads one slice of the input buffer, applies a transform, and writes one slice of the output buffer.
where each invocation reads one or more buffer locations and writes results back to another buffer. Matrix multiplication, normalization, sampling, and attention all reduce to larger versions of this pattern.
Bandwidth-bound kernel
A kernel is bandwidth-bound when runtime is dominated by moving bytes between GPU memory and the compute units rather than by arithmetic throughput. Elementwise chains and many badly structured tensor programs are bandwidth bound; large tiled GEMMs are often closer to compute bound.
Main Propositions
Memory Passes Give a Hard Lower Bound
Statement
If a kernel chain performs full passes over a tensor of size on a device with achievable memory bandwidth , then its runtime is lower-bounded by
.
No instruction scheduling trick can beat this lower bound unless the program reduces either the number of passes or the number of bytes moved.
Intuition
Before you count FLOPs, you must physically move the tensor through memory. Every extra read or write is rent paid to bandwidth. If your ML operator chain touches the same tensor five times, you have already lost a large constant factor before math becomes the bottleneck.
Proof Sketch
Each full pass moves at least bytes. With passes the total transferred volume is at least . A device sustaining bandwidth bytes per second cannot move that data in less than seconds.
Why It Matters
This is the systems reason kernel fusion matters so much in browser ML. The WebGPU API can be perfectly correct and still feel slow if the program keeps materializing intermediate tensors.
Fusion Wins by Avoiding Intermediate Materialization
Statement
Suppose an operator chain applies elementwise transforms to a tensor of size bytes. If implemented as separate kernels, the chain performs roughly bytes of global traffic: one read and one write per pass. If the chain is fused into one kernel, the traffic drops to roughly bytes: one read of the input and one write of the output.
Hence, in the bandwidth-bound regime, fusion improves runtime by about a factor of up to occupancy and register-pressure limits.
Intuition
The browser does not care that your unfused sequence is mathematically simple. It only sees repeated global loads and stores. Fusion helps because the intermediate values stay in registers or local workgroup memory instead of going all the way back to global memory.
Why It Matters
This is why serious browser runtimes do not stop at "WebGPU support." They need compiler or hand-written kernel logic. Otherwise, the API exists but the throughput never becomes good enough for useful model sizes.
Failure Mode
Fusion is not free. Very aggressive fusion can increase register pressure, reduce occupancy, and create kernels that are harder to compile or cache. The real systems problem is balancing memory traffic against resource pressure.
Where Browser ML Sits Today
Three layers matter in practice:
- The standards layer. The GPU for the Web group defines WebGPU and WGSL. This layer fixes the portable execution model.
- The runtime layer. Libraries such as ONNX Runtime Web, WebLLM, and Transformers.js expose actual model-loading and inference interfaces.
- The kernel layer. Someone still has to write or generate the WGSL code for matmuls, normalization, attention, sampling, and caching.
That last point is the one people often miss. WebGPU is not "CUDA in a tab." It is a portable API. The quality of the runtime depends on whether the library has good kernels, good scheduling, and good memory planning.
Why This Matters For Future Labs
The moment we want any of the following, we are really asking for WebGPU work:
- a browser-side transformer that supports residual-stream interventions,
- a Triton-style kernel playground lowered to WGSL,
- an in-browser sparse autoencoder trainer,
- a differentiable renderer or Gaussian splatting artifact,
- or even a reliable browser benchmark for mixed precision.
So WebGPU is not a side topic. It is the infrastructure page that explains why some roadmap items are normal engineering and others are compiler projects.
Common Confusions
WebGPU is not a machine-learning framework
WebGPU is an API for GPU work. It does not give you model loading, tokenization, KV caches, automatic differentiation, or optimized kernels by itself. Those have to be built on top.
Browser GPU compute is real, but launch overhead still matters
For tiny tensors or fragmented operator chains, browser-side GPU dispatch can be slower than well-optimized CPU or WebAssembly code. WebGPU wins when the kernel is big enough and the memory traffic is structured well enough.
WGSL is not just shader syntax trivia
WGSL is the language where the actual compute kernel semantics live: types, address spaces, workgroup behavior, synchronization, and memory layout. If the kernel is wrong in WGSL, the ML runtime above it is wrong no matter how elegant the JavaScript wrapper looks.
Exercises
Problem
A browser ML pipeline applies layer norm, bias add, GELU, and dropout as four separate elementwise WebGPU kernels over a 64 MB activation tensor. If the device sustains 320 GB/s of memory bandwidth, what is the bandwidth lower bound for the unfused chain and for a fully fused chain?
Problem
Why can a naive browser attention implementation be correct but still too slow for useful sequence lengths?
Problem
Suppose you want a browser demo that trains a 100M-parameter language model locally. Which missing pieces are directly about WebGPU, and which are about everything above WebGPU?
References
- GPU for the Web Working Group, WebGPU, Editor's Draft, accessed April 25, 2026. The core specification.
- GPU for the Web Working Group, WebGPU Shading Language (WGSL), Editor's Draft, accessed April 25, 2026. The kernel language itself.
- GPU for the Web Working Group, WebGPU Explainer, accessed April 25, 2026. Best short systems overview of the intended execution model.
- Charlie F. Ruan et al., WebLLM: A High-Performance In-Browser LLM Inference Engine, arXiv, revised April 13, 2026. Strongest current primary source on browser LLM inference.
- Microsoft, Using WebGPU in ONNX Runtime Web, accessed April 25, 2026. Practical runtime-layer reference.
- Hugging Face, Transformers.js Documentation, accessed April 25, 2026. Current high-level browser model API reference.
- Daniel Smilkov et al., TensorFlow.js: Machine Learning for the Web and Beyond, arXiv, 2019. Useful historical baseline for how browser ML evolved before WebGPU became central.
Next Topics
If WebGPU is the substrate, the next questions are about what we actually build on top of it:
- Fused Kernels for the memory-traffic side,
- Mixed Precision Training for numeric budget,
- and 3D Gaussian Splatting for a graphics-and-ML artifact where the pipeline matters as much as the math.
Last reviewed: April 25, 2026
Prerequisites
Foundations this topic depends on.
- Computer Architecture for MLLayer 2
- Floating-Point ArithmeticLayer 0A
- Automatic DifferentiationLayer 1
- The Jacobian MatrixLayer 0A