WebGPU Compute Shaders in Production — 50× Faster Physics

We migrated the Lattice-Boltzmann and SPH fluid simulations from JavaScript to WebGPU compute shaders. The result: 50× faster on desktop, 3D simulation grids that were completely impossible before, and a clean progressive fallback for browsers without WebGPU. Here's everything we learned.

Why WebGPU Compute?

WebGL has compute shaders — sort of. Transform feedback and render-to-texture can be abused as GPGPU, but only for vector operations that fit into a texture coordinate. WebGPU's compute pipelines are first-class: arbitrary data access, read-write storage buffers, atomic operations, and workgroup shared memory. This is what CUDA and Metal Compute gave native apps; now it's in the browser.

Our two target simulations — the Lattice-Boltzmann D2Q9 flow and the SPH fluid — are both embarrassingly parallel at the cell/particle level. The collision step visits every cell independently. The density estimation step visits every particle pair within a radius. These are textbook GPGPU workloads.

Benchmark: JavaScript vs WebGPU

Simulation Grid / Count JS (fps) WebGPU (fps) Speedup
Lattice-Boltzmann 2D 512 × 512 4 60 15×
LBM 2D high-res 1024 × 1024 <1 48 ~50×
Lattice-Boltzmann 3D 128³ (new!) impossible 24
SPH Fluid 4 000 particles 18 60 3.3×
SPH Fluid 16 000 particles 1 38 38×

The WGSL Compute Shader

WGSL (WebGPU Shading Language) is the language of WebGPU shaders. It's closer to Rust than GLSL — explicit types, no implicit casts, explicit binding groups. Here's the core LBM collision step:

WGSL — LBM BGK Collision (simplified)
// binding layout: group 0, bindings 0-2
@group(0) @binding(0) var<storage, read>       f_in  : array<f32>;
@group(0) @binding(1) var<storage, read_write> f_out : array<f32>;
@group(0) @binding(2) var<uniform>             params: Params;

struct Params { nx: u32, ny: u32, tau: f32 }

@compute @workgroup_size(16, 16)
fn lbm_collision(@builtin(global_invocation_id) id: vec3<u32>) {
  let x = id.x; let y = id.y;
  if (x >= params.nx || y >= params.ny) { return; }

  let base = (y * params.nx + x) * 9u;
  var rho = 0.0; var ux = 0.0; var uy = 0.0;

  // Compute density and velocity from distribution functions
  for (var i = 0u; i < 9u; i++) {
    let fi = f_in[base + i];
    rho += fi;
    ux  += fi * EX[i];
    uy  += fi * EY[i];
  }
  ux /= rho; uy /= rho;

  // BGK collision: f_out = f_in - (f_in - f_eq) / tau
  let usq = ux*ux + uy*uy;
  for (var i = 0u; i < 9u; i++) {
    let eu = EX[i]*ux + EY[i]*uy;
    let feq = rho * W[i] * (1.0 + 3.0*eu + 4.5*eu*eu - 1.5*usq);
    f_out[base + i] = f_in[base + i] - (f_in[base + i] - feq) / params.tau;
  }
}

Fallback Strategy

WebGPU has ~65% browser support as of late 2026 (Chrome 113+, Edge, Safari Technology Preview, Firefox behind a flag). For the other 35% we fall back to the JavaScript version. The detection is a single feature check:

JavaScript — WebGPU detection + fallback
async function initSimulation() {
  if ('gpu' in navigator) {
    const adapter = await navigator.gpu.requestAdapter();
    if (adapter) {
      const device = await adapter.requestDevice();
      return new LBMSimulatorWebGPU(device);   // fast path
    }
  }
  return new LBMSimulatorJS();  // fallback — same API, JS implementation
}

The key insight: both the WebGPU and JavaScript implementations expose exactly the same step() and getVelocityField() interface. The rendering code doesn't know which backend it's talking to. This means the fallback is completely transparent to the user — they just see a slower simulation on unsupported browsers.

WebGPU Gotchas We Hit

⚠️ Buffer alignment

WGSL storage buffers must be 256-byte aligned. Our 9-float per-cell LBM buffer needed explicit padding to avoid silent data corruption on some GPUs.

⚠️ No dynamic indexing in uniforms

The D2Q9 weight and direction constants can't be in a uniform buffer with dynamic index. We embedded them as WGSL constants (compile-time arrays).

✅ Workgroup size tuning

16×16 workgroups outperformed 8×8 and 32×32 on most tested hardware. AMD and Apple Silicon preferred 8×8. We expose this as a quality preset.

✅ Streaming from compute to render

The velocity field buffer is bound as both a compute storage buffer and a WebGPU vertex buffer. No CPU readback needed — the data lives on GPU the entire frame.