Tutorial · WebGPU · WGSL · GPGPU
📅 July 2026 ⏱ ≈ 22 min 🎯 Advanced

Compute Shaders: Massively Parallel Simulation

WebGPU's compute pipeline lets you run arbitrary parallel programs directly on the GPU, outside the graphics pipeline entirely. No vertices, no rasterization, no framebuffers — just a grid of threads reading and writing storage buffers. This is the API that finally makes GPGPU (general-purpose GPU computing) a first-class citizen on the web, and it changes how you should think about browser-based simulation.

1. Why Compute Shaders, Not Fragment Shaders

Before WebGPU, browser GPGPU meant abusing the fragment shader: encode simulation state as pixels in a floating-point texture, render a full-screen quad, and decode the output texture as your next state (the technique used in transform feedback / render-to- texture GPGPU). It works, but it forces every computation through the rasterizer, wastes bandwidth on texture sampling logic you don't need, and caps you at 2D grids of texels.

A compute shader is a general-purpose kernel that reads and writes raw storage buffers — flat arrays of structured data, exactly like an SSBO in Vulkan/OpenGL or a kernel in CUDA. You dispatch a 1D, 2D, or 3D grid of threads directly, with no rasterization step in between. Each thread gets a global_invocation_id it uses to index into the buffer, computes something, and writes the result back. This maps far more naturally onto N-body physics, cloth solvers, fluid simulation, sorting, and neural network inference than the texture-quad trick ever did.

Availability: WebGPU ships by default in Chrome and Edge (113+), and Firefox and Safari have shipped or are shipping support as of 2025–2026. Always feature-detect with 'gpu' in navigator and fall back to a WebGL2 transform-feedback path for older browsers (see §10).

2. Requesting a GPU Device

Everything in WebGPU starts from an asynchronous handshake: request an adapter (a physical or virtual GPU), then request a device (a logical connection you actually issue commands through):

async function initGPU() {
  if (!('gpu' in navigator)) {
    throw new Error('WebGPU not supported in this browser');
  }

  const adapter = await navigator.gpu.requestAdapter({
    powerPreference: 'high-performance',
  });
  if (!adapter) throw new Error('No suitable GPU adapter found');

  const device = await adapter.requestDevice({
    requiredLimits: {
      maxStorageBufferBindingSize: 512 * 1024 * 1024, // 512 MB
    },
  });

  device.lost.then(info => {
    console.error(`GPU device lost: ${info.message}`);
  });

  return device;
}

Unlike WebGL's synchronous getContext('webgl2'), this is a promise-based negotiation — the browser may need to spin up a separate GPU process and validate limits before handing you a device. Request the device once and reuse it for the app's lifetime.

3. Storage Buffers and Bind Groups

A storage buffer is a raw block of GPU memory your shader can read and, critically, write — unlike a uniform buffer, which is read-only and capped much smaller (typically 64 KB vs. hundreds of MB for storage). Buffers are created with explicit usage flags:

const PARTICLE_COUNT = 500_000;
const FLOATS_PER_PARTICLE = 8; // pos.xyz, vel.xyz, life, pad

const particleBuffer = device.createBuffer({
  size: PARTICLE_COUNT * FLOATS_PER_PARTICLE * 4, // 4 bytes per f32
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  mappedAtCreation: true, // write initial data before first use
});
new Float32Array(particleBuffer.getMappedRange()).set(initialData);
particleBuffer.unmap();

// A bind group layout declares what a shader stage can access
const bindGroupLayout = device.createBindGroupLayout({
  entries: [{
    binding: 0,
    visibility: GPUShaderStage.COMPUTE,
    buffer: { type: 'storage' },
  }],
});

// A bind group is the concrete resource bound to that layout
const bindGroup = device.createBindGroup({
  layout: bindGroupLayout,
  entries: [{ binding: 0, resource: { buffer: particleBuffer } }],
});
Ping-pong buffers: a compute shader can read and write the same buffer in place if the access pattern is per-element (each thread only touches its own particle), which is simpler than the double-buffer ping-pong required by fragment- shader GPGPU. You only need two buffers when a thread must read neighbors that other threads may concurrently overwrite (e.g. spatial-hash neighbor queries).

4. WGSL: The Shading Language

WGSL (WebGPU Shading Language) is statically typed, has explicit read/read_write access modes on storage bindings, and — unlike GLSL — requires you to declare struct layouts explicitly matching your buffer's memory layout:

struct Particle {
  pos  : vec3<f32>,
  vel  : vec3<f32>,
  life : f32,
  _pad : f32,
}

@group(0) @binding(0) var<storage, read_write> particles : array<Particle>;

struct Uniforms {
  dt         : f32,
  time       : f32,
  noiseScale : f32,
}
@group(0) @binding(1) var<uniform> u : Uniforms;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
  let i = gid.x;
  if (i >= arrayLength(&particles)) { return; } // guard against overhang

  var p = particles[i];
  let curl = curlNoise(p.pos * u.noiseScale + u.time * 0.1);
  p.vel += curl * u.dt;
  p.pos += p.vel * u.dt;
  p.life -= u.dt;

  if (p.life <= 0.0) {
    p.pos  = vec3<f32>(0.0);
    p.vel  = vec3<f32>(0.0);
    p.life = 5.0;
  }
  particles[i] = p;
}
Memory layout gotcha: WGSL packs vec3<f32> to a 16-byte boundary (same as std140), so a struct with vec3, vec3, f32 is 8 floats logically but consumes 8×4 = 32 bytes only if you add the trailing pad field, or the compiler inserts implicit padding you must mirror exactly in your JS-side ArrayBuffer layout. Mismatches here are the most common source of silently garbled compute output.

5. Workgroups and Dispatch Dimensions

A compute kernel is invoked as a grid of workgroups, and each workgroup is itself a grid of invocations (threads) whose size you declare in the shader via @workgroup_size(x, y, z). The total invocation count is workgroup_size × dispatch_count:

total_threads = workgroup_size.x · workgroup_size.y · workgroup_size.z × dispatchX · dispatchY · dispatchZ For 500,000 particles with workgroup_size(64): dispatchX = ceil(500000 / 64) = 7813 → 7813 × 64 = 500,032 threads launched (32 idle, guarded by the arrayLength() check in the shader)
const WORKGROUP_SIZE = 64;
const dispatchX = Math.ceil(PARTICLE_COUNT / WORKGROUP_SIZE);

const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(dispatchX); // 1D dispatch
pass.end();

Choosing workgroup_size matters: 64 is a safe, portable default that divides evenly into the 32-wide warps (NVIDIA) and 64-wide wavefronts (AMD) that actually execute in lockstep on real hardware. Sizes below 32 waste lanes; sizes far above 256 can hurt occupancy by exhausting per-workgroup shared memory and register budgets. Benchmark 64, 128, and 256 for your specific kernel — the optimum is workload- and GPU-dependent.

6. Building the Compute Pipeline

A GPUComputePipeline binds a shader module's entry point to a fixed bind group layout. Once built, dispatching it each frame is cheap — the expensive shader compilation happens once, at pipeline creation:

const shaderModule = device.createShaderModule({ code: wgslSource });

const pipeline = device.createComputePipeline({
  layout: device.createPipelineLayout({
    bindGroupLayouts: [bindGroupLayout],
  }),
  compute: { module: shaderModule, entryPoint: 'main' },
});

function frame() {
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(dispatchX);
  pass.end();
  device.queue.submit([encoder.finish()]);
  requestAnimationFrame(frame);
}

Note there's no await in the frame loop: queue.submit() is fire-and-forget — the GPU executes asynchronously and you never block the main thread waiting for compute to finish, unlike a naive CPU simulation loop.

7. Case Study: 500K-Particle Flow Field

Combining everything above: a curl-noise flow field advecting half a million particles, rendered directly from the same storage buffer using WebGPU's render pipeline (no CPU readback needed — the vertex shader reads particles[vertexIndex] directly):

async function setupFlowField(device) {
  const N = 500_000;
  const initial = new Float32Array(N * 8);
  for (let i = 0; i < N; i++) {
    const o = i * 8;
    initial[o + 0] = (Math.random() - 0.5) * 10; // pos.x
    initial[o + 1] = (Math.random() - 0.5) * 10; // pos.y
    initial[o + 2] = (Math.random() - 0.5) * 10; // pos.z
    initial[o + 6] = Math.random() * 5; // life
  }

  const buffer = device.createBuffer({
    size: initial.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.VERTEX,
    mappedAtCreation: true,
  });
  new Float32Array(buffer.getMappedRange()).set(initial);
  buffer.unmap();
  return buffer;
}

On a mid-range discrete GPU this kernel updates and renders 500,000 particles in well under 2 ms per frame — versus tens of milliseconds for the equivalent CPU loop in JavaScript, and it scales near-linearly with GPU core count rather than being capped by single-thread JS performance. The same architecture generalizes directly to boids flocking, SPH fluid neighbor sums, and cloth constraint solving — anything expressible as "for each element, read some state, write new state."

8. Barriers and Read-After-Write Hazards

Threads within a workgroup can share fast on-chip workgroup memory (var<workgroup>), but reads and writes to it must be explicitly synchronized with workgroupBarrier() — without it, one thread may read stale data another thread hasn't written yet, since invocations in a workgroup do not execute in lockstep instruction-for-instruction across all vendors' hardware:

var<workgroup> tile : array<f32, 64>;

@compute @workgroup_size(64)
fn main(
  @builtin(local_invocation_id) lid : vec3<u32>,
  @builtin(global_invocation_id) gid : vec3<u32>
) {
  tile[lid.x] = particles[gid.x].pos.x;
  workgroupBarrier(); // wait: every thread must finish the write above
                       // before any thread reads tile[]

  var sum = 0.0;
  for (var j = 0u; j < 64u; j += 1u) {
    sum += tile[j]; // safe: barrier guaranteed all 64 writes landed
  }
}

Across different dispatches (e.g. two consecutive dispatchWorkgroups() calls in the same pass, or separate compute passes), WebGPU guarantees ordering automatically — a later dispatch that reads a storage buffer will always see the writes from an earlier dispatch that wrote it. You never need manual buffer barriers between passes the way you would in raw Vulkan; the browser's validation layer inserts the necessary synchronization for you.

9. Performance: Occupancy and Memory Coalescing

Three things dominate compute shader throughput in practice:

Measuring: use GPUQuerySet with type: 'timestamp' to wrap a compute pass and read back precise GPU-side timing — `performance.now()` around `queue.submit()` only measures CPU-side encoding time, not actual GPU execution time.

10. Feature Detection and WebGL Fallback

Roughly a fifth of browser traffic still lacks WebGPU support as of mid-2026 (older Safari/iOS versions, some enterprise Chrome builds with the feature disabled by policy). A production simulation should detect capability and gracefully degrade to the WebGL2 transform-feedback technique:

async function createSimBackend(canvas) {
  if ('gpu' in navigator) {
    const adapter = await navigator.gpu.requestAdapter();
    if (adapter) {
      return new WebGPUBackend(await adapter.requestDevice());
    }
  }
  console.warn('WebGPU unavailable — falling back to WebGL2 transform feedback');
  return new WebGL2Backend(canvas.getContext('webgl2'));
}

Structure the simulation core behind a small interface — step(dt), getPositionBuffer(), dispose() — so the rendering and UI layers never need to know which backend is active. This is the same pattern used throughout the WebGL compute-particles tutorial on this site, and it lets you ship one simulation that runs everywhere while getting the full performance benefit of WebGPU wherever it's available.