Compute Shaders: Massively Parallel Simulation
WebGPU's compute pipeline lets you run arbitrary parallel programs directly on the GPU, outside the graphics pipeline entirely. No vertices, no rasterization, no framebuffers — just a grid of threads reading and writing storage buffers. This is the API that finally makes GPGPU (general-purpose GPU computing) a first-class citizen on the web, and it changes how you should think about browser-based simulation.
1. Why Compute Shaders, Not Fragment Shaders
Before WebGPU, browser GPGPU meant abusing the fragment shader: encode simulation state as pixels in a floating-point texture, render a full-screen quad, and decode the output texture as your next state (the technique used in transform feedback / render-to- texture GPGPU). It works, but it forces every computation through the rasterizer, wastes bandwidth on texture sampling logic you don't need, and caps you at 2D grids of texels.
A compute shader is a general-purpose kernel that
reads and writes raw storage buffers — flat arrays
of structured data, exactly like an SSBO in Vulkan/OpenGL or a
kernel in CUDA. You dispatch a 1D, 2D, or 3D grid of threads
directly, with no rasterization step in between. Each thread gets
a global_invocation_id it uses to index into the
buffer, computes something, and writes the result back. This maps
far more naturally onto N-body physics, cloth solvers, fluid
simulation, sorting, and neural network inference than the
texture-quad trick ever did.
'gpu' in navigator and fall back to a WebGL2
transform-feedback path for older browsers (see
§10).
2. Requesting a GPU Device
Everything in WebGPU starts from an asynchronous handshake: request
an adapter (a physical or virtual GPU), then request a
device (a logical connection you actually issue
commands through):
async function initGPU() {
if (!('gpu' in navigator)) {
throw new Error('WebGPU not supported in this browser');
}
const adapter = await navigator.gpu.requestAdapter({
powerPreference: 'high-performance',
});
if (!adapter) throw new Error('No suitable GPU adapter found');
const device = await adapter.requestDevice({
requiredLimits: {
maxStorageBufferBindingSize: 512 * 1024 * 1024, // 512 MB
},
});
device.lost.then(info => {
console.error(`GPU device lost: ${info.message}`);
});
return device;
}
Unlike WebGL's synchronous getContext('webgl2'), this
is a promise-based negotiation — the browser may need to spin up a
separate GPU process and validate limits before handing you a
device. Request the device once and reuse it for the app's
lifetime.
3. Storage Buffers and Bind Groups
A storage buffer is a raw block of GPU memory your shader can read and, critically, write — unlike a uniform buffer, which is read-only and capped much smaller (typically 64 KB vs. hundreds of MB for storage). Buffers are created with explicit usage flags:
const PARTICLE_COUNT = 500_000;
const FLOATS_PER_PARTICLE = 8; // pos.xyz, vel.xyz, life, pad
const particleBuffer = device.createBuffer({
size: PARTICLE_COUNT * FLOATS_PER_PARTICLE * 4, // 4 bytes per f32
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
mappedAtCreation: true, // write initial data before first use
});
new Float32Array(particleBuffer.getMappedRange()).set(initialData);
particleBuffer.unmap();
// A bind group layout declares what a shader stage can access
const bindGroupLayout = device.createBindGroupLayout({
entries: [{
binding: 0,
visibility: GPUShaderStage.COMPUTE,
buffer: { type: 'storage' },
}],
});
// A bind group is the concrete resource bound to that layout
const bindGroup = device.createBindGroup({
layout: bindGroupLayout,
entries: [{ binding: 0, resource: { buffer: particleBuffer } }],
});
4. WGSL: The Shading Language
WGSL (WebGPU Shading Language) is statically typed, has explicit
read/read_write access modes on storage
bindings, and — unlike GLSL — requires you to declare struct
layouts explicitly matching your buffer's memory layout:
struct Particle {
pos : vec3<f32>,
vel : vec3<f32>,
life : f32,
_pad : f32,
}
@group(0) @binding(0) var<storage, read_write> particles : array<Particle>;
struct Uniforms {
dt : f32,
time : f32,
noiseScale : f32,
}
@group(0) @binding(1) var<uniform> u : Uniforms;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
let i = gid.x;
if (i >= arrayLength(&particles)) { return; } // guard against overhang
var p = particles[i];
let curl = curlNoise(p.pos * u.noiseScale + u.time * 0.1);
p.vel += curl * u.dt;
p.pos += p.vel * u.dt;
p.life -= u.dt;
if (p.life <= 0.0) {
p.pos = vec3<f32>(0.0);
p.vel = vec3<f32>(0.0);
p.life = 5.0;
}
particles[i] = p;
}
vec3<f32>
to a 16-byte boundary (same as std140), so a struct with
vec3, vec3, f32 is 8 floats logically but consumes
8×4 = 32 bytes only if you add the trailing pad field, or the
compiler inserts implicit padding you must mirror exactly in your
JS-side ArrayBuffer layout. Mismatches here are the
most common source of silently garbled compute output.
5. Workgroups and Dispatch Dimensions
A compute kernel is invoked as a grid of workgroups,
and each workgroup is itself a grid of invocations
(threads) whose size you declare in the shader via
@workgroup_size(x, y, z). The total invocation count
is workgroup_size × dispatch_count:
const WORKGROUP_SIZE = 64;
const dispatchX = Math.ceil(PARTICLE_COUNT / WORKGROUP_SIZE);
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(dispatchX); // 1D dispatch
pass.end();
Choosing workgroup_size matters: 64 is a safe,
portable default that divides evenly into the 32-wide warps
(NVIDIA) and 64-wide wavefronts (AMD) that actually execute in
lockstep on real hardware. Sizes below 32 waste lanes; sizes far
above 256 can hurt occupancy by exhausting per-workgroup shared
memory and register budgets. Benchmark 64, 128, and 256 for your
specific kernel — the optimum is workload- and GPU-dependent.
6. Building the Compute Pipeline
A GPUComputePipeline binds a shader module's entry
point to a fixed bind group layout. Once built, dispatching it
each frame is cheap — the expensive shader compilation happens
once, at pipeline creation:
const shaderModule = device.createShaderModule({ code: wgslSource });
const pipeline = device.createComputePipeline({
layout: device.createPipelineLayout({
bindGroupLayouts: [bindGroupLayout],
}),
compute: { module: shaderModule, entryPoint: 'main' },
});
function frame() {
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(dispatchX);
pass.end();
device.queue.submit([encoder.finish()]);
requestAnimationFrame(frame);
}
Note there's no await in the frame loop:
queue.submit() is fire-and-forget — the GPU executes
asynchronously and you never block the main thread waiting for
compute to finish, unlike a naive CPU simulation loop.
7. Case Study: 500K-Particle Flow Field
Combining everything above: a curl-noise flow field advecting half
a million particles, rendered directly from the same storage
buffer using WebGPU's render pipeline (no CPU readback needed —
the vertex shader reads particles[vertexIndex]
directly):
async function setupFlowField(device) {
const N = 500_000;
const initial = new Float32Array(N * 8);
for (let i = 0; i < N; i++) {
const o = i * 8;
initial[o + 0] = (Math.random() - 0.5) * 10; // pos.x
initial[o + 1] = (Math.random() - 0.5) * 10; // pos.y
initial[o + 2] = (Math.random() - 0.5) * 10; // pos.z
initial[o + 6] = Math.random() * 5; // life
}
const buffer = device.createBuffer({
size: initial.byteLength,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.VERTEX,
mappedAtCreation: true,
});
new Float32Array(buffer.getMappedRange()).set(initial);
buffer.unmap();
return buffer;
}
On a mid-range discrete GPU this kernel updates and renders 500,000 particles in well under 2 ms per frame — versus tens of milliseconds for the equivalent CPU loop in JavaScript, and it scales near-linearly with GPU core count rather than being capped by single-thread JS performance. The same architecture generalizes directly to boids flocking, SPH fluid neighbor sums, and cloth constraint solving — anything expressible as "for each element, read some state, write new state."
8. Barriers and Read-After-Write Hazards
Threads within a workgroup can share fast on-chip
workgroup memory (var<workgroup>),
but reads and writes to it must be explicitly synchronized with
workgroupBarrier() — without it, one thread may read
stale data another thread hasn't written yet, since invocations in
a workgroup do not execute in lockstep instruction-for-instruction
across all vendors' hardware:
var<workgroup> tile : array<f32, 64>;
@compute @workgroup_size(64)
fn main(
@builtin(local_invocation_id) lid : vec3<u32>,
@builtin(global_invocation_id) gid : vec3<u32>
) {
tile[lid.x] = particles[gid.x].pos.x;
workgroupBarrier(); // wait: every thread must finish the write above
// before any thread reads tile[]
var sum = 0.0;
for (var j = 0u; j < 64u; j += 1u) {
sum += tile[j]; // safe: barrier guaranteed all 64 writes landed
}
}
Across different dispatches (e.g. two consecutive
dispatchWorkgroups() calls in the same pass, or
separate compute passes), WebGPU guarantees ordering automatically
— a later dispatch that reads a storage buffer will always see the
writes from an earlier dispatch that wrote it. You never need
manual buffer barriers between passes the way you would in raw
Vulkan; the browser's validation layer inserts the necessary
synchronization for you.
9. Performance: Occupancy and Memory Coalescing
Three things dominate compute shader throughput in practice:
-
Occupancy: how many workgroups can run
concurrently on a compute unit, limited by shared memory and
register usage per thread. Keep
var<workgroup>arrays small and avoid excessive local variables in hot loops. -
Memory coalescing: adjacent threads
(
gid.x, gid.x+1, ...) should read/write adjacent memory addresses. A struct-of-arrays layout (separatepositions[],velocities[]buffers) often coalesces better than array-of-structs for wide, uniform-access kernels, though AoS is usually easier to reason about and is fine when each thread's fields are read together. -
Divergent branching: an
ifthat takes different paths across threads in the same workgroup forces the hardware to execute both paths serially for the whole group. Keep branches uniform where possible, or restructure withselect()for cheap two-way choices.
GPUQuerySet with type: 'timestamp' to
wrap a compute pass and read back precise GPU-side timing —
`performance.now()` around `queue.submit()` only measures CPU-side
encoding time, not actual GPU execution time.
10. Feature Detection and WebGL Fallback
Roughly a fifth of browser traffic still lacks WebGPU support as of mid-2026 (older Safari/iOS versions, some enterprise Chrome builds with the feature disabled by policy). A production simulation should detect capability and gracefully degrade to the WebGL2 transform-feedback technique:
async function createSimBackend(canvas) {
if ('gpu' in navigator) {
const adapter = await navigator.gpu.requestAdapter();
if (adapter) {
return new WebGPUBackend(await adapter.requestDevice());
}
}
console.warn('WebGPU unavailable — falling back to WebGL2 transform feedback');
return new WebGL2Backend(canvas.getContext('webgl2'));
}
Structure the simulation core behind a small interface —
step(dt), getPositionBuffer(),
dispose() — so the rendering and UI layers never need
to know which backend is active. This is the same pattern used
throughout the WebGL compute-particles tutorial on this site, and
it lets you ship one simulation that runs everywhere while getting
the full performance benefit of WebGPU wherever it's available.