1. Why Compute Shaders, Not Fragment Shaders

Before WebGPU, browser GPGPU meant abusing the fragment shader: encode simulation state as pixels in a floating-point texture, render a full-screen quad, and decode the output texture as your next state (the technique used in transform feedback / render-to- texture GPGPU). It works, but it forces every computation through the rasterizer, wastes bandwidth on texture sampling logic you don't need, and caps you at 2D grids of texels.

A compute shader is a general-purpose kernel that reads and writes raw storage buffers — flat arrays of structured data, exactly like an SSBO in Vulkan/OpenGL or a kernel in CUDA. You dispatch a 1D, 2D, or 3D grid of threads directly, with no rasterization step in between. Each thread gets a global_invocation_id it uses to index into the buffer, computes something, and writes the result back. This maps far more naturally onto N-body physics, cloth solvers, fluid simulation, sorting, and neural network inference than the texture-quad trick ever did.

Availability: WebGPU ships by default in Chrome and Edge (113+), and Firefox and Safari have shipped or are shipping support as of 2025–2026. Always feature-detect with 'gpu' in navigator and fall back to a WebGL2 transform-feedback path for older browsers (see §10).

2. Requesting a GPU Device

Everything in WebGPU starts from an asynchronous handshake: request an adapter (a physical or virtual GPU), then request a device (a logical connection you actually issue commands through):

async function initGPU() {
  if (!('gpu' in navigator)) {
    throw new Error('WebGPU not supported in this browser');
  }

  const adapter = await navigator.gpu.requestAdapter({
    powerPreference: 'high-performance',
  });
  if (!adapter) throw new Error('No suitable GPU adapter found');

  const device = await adapter.requestDevice({
    requiredLimits: {
      maxStorageBufferBindingSize: 512 * 1024 * 1024, // 512 MB
    },
  });

  device.lost.then(info => {
    console.error(`GPU device lost: ${info.message}`);
  });

  return device;
}

Unlike WebGL's synchronous getContext('webgl2'), this is a promise-based negotiation — the browser may need to spin up a separate GPU process and validate limits before handing you a device. Request the device once and reuse it for the app's lifetime.

3. Storage Buffers and Bind Groups

A storage buffer is a raw block of GPU memory your shader can read and, critically, write — unlike a uniform buffer, which is read-only and capped much smaller (typically 64 KB vs. hundreds of MB for storage). Buffers are created with explicit usage flags:

const PARTICLE_COUNT = 500_000;
const FLOATS_PER_PARTICLE = 8; // pos.xyz, vel.xyz, life, pad

const particleBuffer = device.createBuffer({
  size: PARTICLE_COUNT * FLOATS_PER_PARTICLE * 4, // 4 bytes per f32
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  mappedAtCreation: true, // write initial data before first use
});
new Float32Array(particleBuffer.getMappedRange()).set(initialData);
particleBuffer.unmap();

// A bind group layout declares what a shader stage can access
const bindGroupLayout = device.createBindGroupLayout({
  entries: [{
    binding: 0,
    visibility: GPUShaderStage.COMPUTE,
    buffer: { type: 'storage' },
  }],
});

// A bind group is the concrete resource bound to that layout
const bindGroup = device.createBindGroup({
  layout: bindGroupLayout,
  entries: [{ binding: 0, resource: { buffer: particleBuffer } }],
});

Ping-pong buffers: a compute shader can read and write the same buffer in place if the access pattern is per-element (each thread only touches its own particle), which is simpler than the double-buffer ping-pong required by fragment- shader GPGPU. You only need two buffers when a thread must read neighbors that other threads may concurrently overwrite (e.g. spatial-hash neighbor queries).

4. WGSL: The Shading Language

WGSL (WebGPU Shading Language) is statically typed, has explicit read/read_write access modes on storage bindings, and — unlike GLSL — requires you to declare struct layouts explicitly matching your buffer's memory layout:

struct Particle {
  pos  : vec3<f32>,
  vel  : vec3<f32>,
  life : f32,
  _pad : f32,
}

@group(0) @binding(0) var<storage, read_write> particles : array<Particle>;

struct Uniforms {
  dt         : f32,
  time       : f32,
  noiseScale : f32,
}
@group(0) @binding(1) var<uniform> u : Uniforms;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
  let i = gid.x;
  if (i >= arrayLength(&particles)) { return; } // guard against overhang

  var p = particles[i];
  let curl = curlNoise(p.pos * u.noiseScale + u.time * 0.1);
  p.vel += curl * u.dt;
  p.pos += p.vel * u.dt;
  p.life -= u.dt;

  if (p.life <= 0.0) {
    p.pos  = vec3<f32>(0.0);
    p.vel  = vec3<f32>(0.0);
    p.life = 5.0;
  }
  particles[i] = p;
}

Memory layout gotcha: WGSL packs vec3<f32> to a 16-byte boundary (same as std140), so a struct with vec3, vec3, f32 is 8 floats logically but consumes 8×4 = 32 bytes only if you add the trailing pad field, or the compiler inserts implicit padding you must mirror exactly in your JS-side ArrayBuffer layout. Mismatches here are the most common source of silently garbled compute output.

5. Workgroups and Dispatch Dimensions

A compute kernel is invoked as a grid of workgroups, and each workgroup is itself a grid of invocations (threads) whose size you declare in the shader via @workgroup_size(x, y, z). The total invocation count is workgroup_size × dispatch_count:

total_threads = workgroup_size.x \cdot workgroup_size.y \cdot workgroup_size.z \times dispatchX \cdot dispatchY \cdot dispatchZ For 500,000 particles with workgroup_size(64): dispatchX = ceil(500000 / 64) = 7813 \to 7813 \times 64 = 500,032 threads launched (32 idle, guarded by the arrayLength() check in the shader)

const WORKGROUP_SIZE = 64;
const dispatchX = Math.ceil(PARTICLE_COUNT / WORKGROUP_SIZE);

const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(dispatchX); // 1D dispatch
pass.end();

Choosing workgroup_size matters: 64 is a safe, portable default that divides evenly into the 32-wide warps (NVIDIA) and 64-wide wavefronts (AMD) that actually execute in lockstep on real hardware. Sizes below 32 waste lanes; sizes far above 256 can hurt occupancy by exhausting per-workgroup shared memory and register budgets. Benchmark 64, 128, and 256 for your specific kernel — the optimum is workload- and GPU-dependent.

6. Building the Compute Pipeline

A GPUComputePipeline binds a shader module's entry point to a fixed bind group layout. Once built, dispatching it each frame is cheap — the expensive shader compilation happens once, at pipeline creation:

const shaderModule = device.createShaderModule({ code: wgslSource });

const pipeline = device.createComputePipeline({
  layout: device.createPipelineLayout({
    bindGroupLayouts: [bindGroupLayout],
  }),
  compute: { module: shaderModule, entryPoint: 'main' },
});

function frame() {
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(dispatchX);
  pass.end();
  device.queue.submit([encoder.finish()]);
  requestAnimationFrame(frame);
}

Note there's no await in the frame loop: queue.submit() is fire-and-forget — the GPU executes asynchronously and you never block the main thread waiting for compute to finish, unlike a naive CPU simulation loop.

7. Case Study: 500K-Particle Flow Field

Combining everything above: a curl-noise flow field advecting half a million particles, rendered directly from the same storage buffer using WebGPU's render pipeline (no CPU readback needed — the vertex shader reads particles[vertexIndex] directly):

async function setupFlowField(device) {
  const N = 500_000;
  const initial = new Float32Array(N * 8);
  for (let i = 0; i < N; i++) {
    const o = i * 8;
    initial[o + 0] = (Math.random() - 0.5) * 10; // pos.x
    initial[o + 1] = (Math.random() - 0.5) * 10; // pos.y
    initial[o + 2] = (Math.random() - 0.5) * 10; // pos.z
    initial[o + 6] = Math.random() * 5; // life
  }

  const buffer = device.createBuffer({
    size: initial.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.VERTEX,
    mappedAtCreation: true,
  });
  new Float32Array(buffer.getMappedRange()).set(initial);
  buffer.unmap();
  return buffer;
}

On a mid-range discrete GPU this kernel updates and renders 500,000 particles in well under 2 ms per frame — versus tens of milliseconds for the equivalent CPU loop in JavaScript, and it scales near-linearly with GPU core count rather than being capped by single-thread JS performance. The same architecture generalizes directly to boids flocking, SPH fluid neighbor sums, and cloth constraint solving — anything expressible as "for each element, read some state, write new state."

8. Barriers and Read-After-Write Hazards

Threads within a workgroup can share fast on-chip workgroup memory (var<workgroup>), but reads and writes to it must be explicitly synchronized with workgroupBarrier() — without it, one thread may read stale data another thread hasn't written yet, since invocations in a workgroup do not execute in lockstep instruction-for-instruction across all vendors' hardware:

var<workgroup> tile : array<f32, 64>;

@compute @workgroup_size(64)
fn main(
  @builtin(local_invocation_id) lid : vec3<u32>,
  @builtin(global_invocation_id) gid : vec3<u32>
) {
  tile[lid.x] = particles[gid.x].pos.x;
  workgroupBarrier(); // wait: every thread must finish the write above
                       // before any thread reads tile[]

  var sum = 0.0;
  for (var j = 0u; j < 64u; j += 1u) {
    sum += tile[j]; // safe: barrier guaranteed all 64 writes landed
  }
}

Across different dispatches (e.g. two consecutive dispatchWorkgroups() calls in the same pass, or separate compute passes), WebGPU guarantees ordering automatically — a later dispatch that reads a storage buffer will always see the writes from an earlier dispatch that wrote it. You never need manual buffer barriers between passes the way you would in raw Vulkan; the browser's validation layer inserts the necessary synchronization for you.

9. Performance: Occupancy and Memory Coalescing

Three things dominate compute shader throughput in practice:

Occupancy: how many workgroups can run concurrently on a compute unit, limited by shared memory and register usage per thread. Keep var<workgroup> arrays small and avoid excessive local variables in hot loops.
Memory coalescing: adjacent threads (gid.x, gid.x+1, ...) should read/write adjacent memory addresses. A struct-of-arrays layout (separate positions[], velocities[] buffers) often coalesces better than array-of-structs for wide, uniform-access kernels, though AoS is usually easier to reason about and is fine when each thread's fields are read together.
Divergent branching: an if that takes different paths across threads in the same workgroup forces the hardware to execute both paths serially for the whole group. Keep branches uniform where possible, or restructure with select() for cheap two-way choices.

Measuring: use GPUQuerySet with type: 'timestamp' to wrap a compute pass and read back precise GPU-side timing — `performance.now()` around `queue.submit()` only measures CPU-side encoding time, not actual GPU execution time.

10. Feature Detection and WebGL Fallback

Roughly a fifth of browser traffic still lacks WebGPU support as of mid-2026 (older Safari/iOS versions, some enterprise Chrome builds with the feature disabled by policy). A production simulation should detect capability and gracefully degrade to the WebGL2 transform-feedback technique:

async function createSimBackend(canvas) {
  if ('gpu' in navigator) {
    const adapter = await navigator.gpu.requestAdapter();
    if (adapter) {
      return new WebGPUBackend(await adapter.requestDevice());
    }
  }
  console.warn('WebGPU unavailable — falling back to WebGL2 transform feedback');
  return new WebGL2Backend(canvas.getContext('webgl2'));
}

Structure the simulation core behind a small interface — step(dt), getPositionBuffer(), dispose() — so the rendering and UI layers never need to know which backend is active. This is the same pattern used throughout the WebGL compute-particles tutorial on this site, and it lets you ship one simulation that runs everywhere while getting the full performance benefit of WebGPU wherever it's available.