Skip to main content

Infrastructure

WebGPU for Machine Learning

WebGPU gives the browser an explicit GPU compute model: devices, queues, buffers, and WGSL kernels. That is the missing substrate for serious in-browser inference, custom kernels, and eventually browser-native training systems.

AdvancedTier 2Current~55 min

Why This Matters

If we want real browser-side ML artifacts, WebGPU is the substrate. It is the difference between a browser that can display model outputs and a browser that can actually run kernels for matmuls, attention, sampling, and rendering.

Older browser ML stacks had to tunnel through APIs that were designed for graphics rather than compute. WebGPU changes that. You get explicit buffers, command queues, compute pipelines, workgroups, and a shading language called WGSL designed for secure GPU execution on the web. That is why modern browser runtimes such as WebLLM and ONNX Runtime Web can do more than toy demos.

This matters directly for the roadmap we have been discussing:

  • browser-side diffusion models,
  • browser inference over open-weight models,
  • kernel playgrounds and roofline demonstrations,
  • 3D Gaussian Splatting,
  • and eventually more ambitious artifacts such as activation engineering or training-scale simulations.

WebGPU is the browser layer where ML stops pretending to be graphics

The point is explicit compute: buffers, workgroups, command queues, and kernels written in WGSL. That is the substrate behind serious browser inference, custom kernels, and eventually training.

TypeScript / JS appWebGPU APIWGSL kernelsMetal / Vulkan / D3D12browser ML runtimes built on topONNX Runtime WebWebLLMTransformers.jscustom kernels

TypeScript / JS app

tokenizer, scheduler, UI, model loader

WebGPU API

device, queue, buffers, bind groups

WGSL kernels

matmul, softmax, layernorm, sampling

Metal / Vulkan / D3D12

native backend chosen by the browser

The web stack becomes viable only when the kernels are large enough to amortize upload cost, dispatch overhead, and intermediate materialization.

What WebGPU adds

Compute shaders, explicit buffers, command encoding, and predictable memory movement. That is why browser ML no longer has to pretend graphics APIs are tensor runtimes.

Where the cost still lives

Compilation latency, upload bandwidth, dispatch overhead, and VRAM pressure still dominate small or badly fused workloads. Browser GPU compute is real; it is not magic.

The systems lens

If a kernel chain makes full passes over a tensor of size bytes on a device with bandwidth , then runtime is lower-bounded by roughly before arithmetic even enters the conversation.

Mental Model

Think of WebGPU as the browser's safe wrapper around modern native GPU APIs. Your JavaScript or TypeScript code does not directly "run on the GPU." It allocates buffers, compiles WGSL shader modules, describes a compute pipeline, records commands, and submits those commands to a device queue. The GPU then executes the kernels in parallel.

That means browser ML performance is governed by the same systems questions that govern native GPU work:

  • how many bytes cross memory,
  • how many times intermediate tensors are materialized,
  • whether kernels are fused or fragmented,
  • whether the workload is large enough to amortize launch overhead,
  • and whether the implementation respects the device's workgroup and memory structure.

Formal Setup

Definition

WebGPU

WebGPU is the browser API for modern GPU graphics and compute. The core objects for ML are:

  • a device, which owns GPU resources and pipelines,
  • a queue, which submits encoded work,
  • buffers and textures, which hold data,
  • bind groups, which describe which buffers a kernel may access,
  • and compute pipelines, which pair a WGSL entry point with its layout.

The browser maps this model to native backends such as Metal, Vulkan, or Direct3D 12.

Definition

WGSL compute kernel

A WGSL compute kernel is a shader entry point that runs over a grid of invocations grouped into workgroups. A typical elementwise ML kernel looks like this conceptually: each invocation receives an element index, reads one slice of the input buffer, applies a transform, and writes one slice of the output buffer.

where each invocation reads one or more buffer locations and writes results back to another buffer. Matrix multiplication, normalization, sampling, and attention all reduce to larger versions of this pattern.

Definition

Bandwidth-bound kernel

A kernel is bandwidth-bound when runtime is dominated by moving bytes between GPU memory and the compute units rather than by arithmetic throughput. Elementwise chains and many badly structured tensor programs are bandwidth bound; large tiled GEMMs are often closer to compute bound.

Main Propositions

Proposition

Memory Passes Give a Hard Lower Bound

Statement

If a kernel chain performs kk full passes over a tensor of size MM on a device with achievable memory bandwidth BB, then its runtime is lower-bounded by

TkMBT \geq \frac{kM}{B}.

No instruction scheduling trick can beat this lower bound unless the program reduces either the number of passes or the number of bytes moved.

Intuition

Before you count FLOPs, you must physically move the tensor through memory. Every extra read or write is rent paid to bandwidth. If your ML operator chain touches the same tensor five times, you have already lost a large constant factor before math becomes the bottleneck.

Proof Sketch

Each full pass moves at least MM bytes. With kk passes the total transferred volume is at least kMkM. A device sustaining bandwidth BB bytes per second cannot move that data in less than kM/BkM/B seconds.

Why It Matters

This is the systems reason kernel fusion matters so much in browser ML. The WebGPU API can be perfectly correct and still feel slow if the program keeps materializing intermediate tensors.

Proposition

Fusion Wins by Avoiding Intermediate Materialization

Statement

Suppose an operator chain applies kk elementwise transforms to a tensor of size MM bytes. If implemented as kk separate kernels, the chain performs roughly 2kM2kM bytes of global traffic: one read and one write per pass. If the chain is fused into one kernel, the traffic drops to roughly 2M2M bytes: one read of the input and one write of the output.

Hence, in the bandwidth-bound regime, fusion improves runtime by about a factor of kk up to occupancy and register-pressure limits.

Intuition

The browser does not care that your unfused sequence is mathematically simple. It only sees repeated global loads and stores. Fusion helps because the intermediate values stay in registers or local workgroup memory instead of going all the way back to global memory.

Why It Matters

This is why serious browser runtimes do not stop at "WebGPU support." They need compiler or hand-written kernel logic. Otherwise, the API exists but the throughput never becomes good enough for useful model sizes.

Failure Mode

Fusion is not free. Very aggressive fusion can increase register pressure, reduce occupancy, and create kernels that are harder to compile or cache. The real systems problem is balancing memory traffic against resource pressure.

Where Browser ML Sits Today

Three layers matter in practice:

  1. The standards layer. The GPU for the Web group defines WebGPU and WGSL. This layer fixes the portable execution model.
  2. The runtime layer. Libraries such as ONNX Runtime Web, WebLLM, and Transformers.js expose actual model-loading and inference interfaces.
  3. The kernel layer. Someone still has to write or generate the WGSL code for matmuls, normalization, attention, sampling, and caching.

That last point is the one people often miss. WebGPU is not "CUDA in a tab." It is a portable API. The quality of the runtime depends on whether the library has good kernels, good scheduling, and good memory planning.

Why This Matters For Future Labs

The moment we want any of the following, we are really asking for WebGPU work:

  • a browser-side transformer that supports residual-stream interventions,
  • a Triton-style kernel playground lowered to WGSL,
  • an in-browser sparse autoencoder trainer,
  • a differentiable renderer or Gaussian splatting artifact,
  • or even a reliable browser benchmark for mixed precision.

So WebGPU is not a side topic. It is the infrastructure page that explains why some roadmap items are normal engineering and others are compiler projects.

Common Confusions

Watch Out

WebGPU is not a machine-learning framework

WebGPU is an API for GPU work. It does not give you model loading, tokenization, KV caches, automatic differentiation, or optimized kernels by itself. Those have to be built on top.

Watch Out

Browser GPU compute is real, but launch overhead still matters

For tiny tensors or fragmented operator chains, browser-side GPU dispatch can be slower than well-optimized CPU or WebAssembly code. WebGPU wins when the kernel is big enough and the memory traffic is structured well enough.

Watch Out

WGSL is not just shader syntax trivia

WGSL is the language where the actual compute kernel semantics live: types, address spaces, workgroup behavior, synchronization, and memory layout. If the kernel is wrong in WGSL, the ML runtime above it is wrong no matter how elegant the JavaScript wrapper looks.

Exercises

ExerciseCore

Problem

A browser ML pipeline applies layer norm, bias add, GELU, and dropout as four separate elementwise WebGPU kernels over a 64 MB activation tensor. If the device sustains 320 GB/s of memory bandwidth, what is the bandwidth lower bound for the unfused chain and for a fully fused chain?

ExerciseAdvanced

Problem

Why can a naive browser attention implementation be correct but still too slow for useful sequence lengths?

ExerciseResearch

Problem

Suppose you want a browser demo that trains a 100M-parameter language model locally. Which missing pieces are directly about WebGPU, and which are about everything above WebGPU?

References

Next Topics

If WebGPU is the substrate, the next questions are about what we actually build on top of it:

Last reviewed: April 25, 2026

Prerequisites

Foundations this topic depends on.

Next Topics