Graphics & Rendering
April 2026 · 16 min read · WebGL · WebGPU · GPU Performance

Instanced Rendering & LOD: Draw Millions of Objects at 60 fps

A modern game or simulation should render hundreds of thousands of trees, grass blades, particles, or simulation objects without dropping below 60 fps. Two GPU techniques make this possible: instanced rendering collapses many draw calls into one, and level-of-detail (LOD) substitutes distant objects with cheaper approximations. This article explains both at the GPU level and shows how to implement them in WebGL 2 and WebGPU.

1. The Draw-Call Bottleneck

Every call to gl.drawElements() (or its equivalents) creates significant CPU overhead: the driver validates state, compiles a command buffer, flushes it to the GPU command queue, and synchronises memory. On a modern desktop GPU, the raw draw-call budget is typically 5,000–50,000 draw calls per frame before the CPU saturates — regardless of how fast the GPU itself is.

For a forest scene with 200,000 trees, issuing one draw call per tree is physically impossible. The solutions are:

Key metric: On a high-end desktop (R9 7900X + RTX 4080), uninstanced draw calls saturate the CPU at ~50,000 calls/frame. Instanced rendering achieves the same pixel output with 1–10 calls, freeing the CPU for physics, AI, and simulation work.

2. Instanced Rendering: gl_InstanceID

In WebGL 2 (which exposes OpenGL ES 3.0), instanced rendering is invoked with:

// WebGL 2 — draw 100,000 trees with one call
gl.drawElementsInstanced(
  gl.TRIANGLES,         // primitive type
  indexCount,           // indices in base mesh
  gl.UNSIGNED_INT,      // index type
  0,                    // byte offset
  100_000               // instanceCount
);
    

Inside the vertex shader, the built-in variable gl_InstanceID (GLSL ES 3.0) contains the index of the current instance (0 … instanceCount−1). Use it to fetch per-instance data from a uniform array or a texture:

// Vertex shader (GLSL ES 3.0)
#version 300 es
in vec3 a_position;   // base mesh vertex
in vec3 a_normal;

// Per-instance data (divisor = 1)
in mat4 a_instanceMatrix;  // 4 consecutive vec4 attributes

uniform mat4 u_viewProj;

out vec3 v_normal;

void main() {
  mat3 normalMat = transpose(inverse(mat3(a_instanceMatrix)));
  v_normal = normalize(normalMat * a_normal);
  gl_Position = u_viewProj * a_instanceMatrix * vec4(a_position, 1.0);
}
    

The per-instance matrix occupies 4 attribute slots. Setting the attribute divisor to 1 tells the GPU to advance the instance data pointer once per instance rather than once per vertex:

// Bind per-instance transform buffer
gl.bindBuffer(gl.ARRAY_BUFFER, instanceMatrixBuffer);
const bytesPerMatrix = 16 * 4;
for (let i = 0; i < 4; i++) {
  const attribLoc = matrixAttribLocation + i;
  gl.enableVertexAttribArray(attribLoc);
  gl.vertexAttribPointer(attribLoc, 4, gl.FLOAT, false,
    bytesPerMatrix, i * 16); // each row = 4 floats × 4 bytes
  gl.vertexAttribDivisor(attribLoc, 1); // advance once per INSTANCE
}
    

3. Per-Instance Data Layout

Each instance needs at least a 4×4 transform matrix (64 bytes). Commonly added per-instance data:

Total: ~96 bytes per instance. For 1,000,000 instances that is 96 MB — comfortably within modern GPU VRAM. Upload once; update only dirty instances per frame.

Packing into a Texture

An alternative to attribute buffers is a data texture: pack matrices as RGBA32F pixels, look them up in the vertex shader using texelFetch() with the instance ID:

// Pack: N matrices → texture of width W = 4 rows/matrix, height = ceil(N/4)
// Look up in vertex shader:
int baseTexel = gl_InstanceID * 4;
mat4 transform = mat4(
  texelFetch(u_instanceTex, ivec2(baseTexel+0, 0), 0),
  texelFetch(u_instanceTex, ivec2(baseTexel+1, 0), 0),
  texelFetch(u_instanceTex, ivec2(baseTexel+2, 0), 0),
  texelFetch(u_instanceTex, ivec2(baseTexel+3, 0), 0)
);
    

This is particularly efficient when the instance data is already produced by a compute shader (Texture Output or Storage Texture in WebGPU), avoiding the CPU→GPU transfer entirely.

4. Frustum & Occlusion Culling

Frustum Culling

Even with instancing, drawing 1,000,000 invisible trees wastes vertex shader cycles. Frustum culling rejects instances whose bounding sphere lies entirely outside the view frustum's 6 half-spaces:

// Frustum plane equation: p = {normal: n, distance: d} // Sphere (centre c, radius r) is outside if: dot(p.n, c) + p.d + r < 0 for any of the 6 planes

For CPU-side culling, iterate all instances and write visible instance matrices to a compact buffer (prefix-sum compaction), then draw only N_visible instances. CPU culling is simple but limits scalability.

GPU-Driven Culling

The modern approach moves culling entirely to a compute shader:

  1. Compute shader reads all instance bounding spheres from a storage buffer.
  2. Per-instance thread: test against frustum planes; if visible, atomically increment a counter and write to a compact output buffer (stream compaction).
  3. Write the count to an indirect draw argument buffer.
  4. Call drawIndirect() — the GPU draws exactly the culled instance count without the CPU knowing the number.

This pipeline keeps data entirely on the GPU and eliminates the CPU readback bottleneck. In WebGPU this is straightforward; in WebGL 2 it requires the WEBGL_draw_instanced_base_vertex_base_instance extension for indirect draw support.

Occlusion Culling

Occlusion culling additionally rejects objects hidden behind nearer geometry. Hierarchical Z-buffer (Hi-Z) occlusion culling builds a mip-pyramid of the depth buffer; each instance's bounding box is tested at the appropriate mip level. If all projected pixels of the boundinq box are deeper than the stored max-depth, the object is occluded. This is the technique used by Frostbite, Killzone, and other AAA engines for massive scene draw-call reduction.

5. Discrete LOD & Screen-Space Error

A level-of-detail (LOD) system stores multiple pre-simplified versions of a mesh and selects among them based on the object's distance or projected screen size:

// Screen-space projected diameter (approximate): screenSize = (objectRadius * projectionScaleFactor) / depth // LOD thresholds (example for a tree): if (screenSize > 0.15) LOD = 0; // full mesh (~8000 tris) if (screenSize > 0.05) LOD = 1; // mid-detail (~1500 tris) if (screenSize > 0.01) LOD = 2; // low-detail (~200 tris) else LOD = 3; // impostor sprite

Screen-space size is the correct metric — a small helicopter at 10 m and a large sky-scraper at 1 km may project to the same screen diameter and deserve the same LOD. Using raw distance as the threshold produces visually wrong LOD switches.

Nanite-Style Virtual Geometry

Unreal Engine 5's Nanite takes LOD to its limit: every triangle cluster in the scene has a precomputed screen-space error bound, and the runtime selects the coarsest level whose error stays below 1 pixel. No hand-authored LOD levels; the entire mesh hierarchy is built offline using a DAG (Directed Acyclic Graph) of progressively simplified cluster groups. The GPU culls and selects at cluster granularity in a compute shader, then rasterises with a custom software rasteriser for small (sub-pixel) triangles. This is far beyond what WebGL can do today, but WebGPU's compute pipelines are the necessary first step.

LOD with Instancing

Combining LOD and instancing requires sorting instances into per-LOD buckets and issuing one instanced draw call per LOD per material. With 4 LOD levels and 3 materials, that's 12 draw calls — still far fewer than one call per object. GPU-driven approaches can do this sort in a compute shader using parallel prefix-sum.

6. CLOD & Geomorphing

Discrete LOD causes a visible popping artefact when an object switches between levels — a sudden change in vertex count and position. Two techniques eliminate this:

Alpha LOD (Dithered Crossfade)

Render both the current and next LOD level simultaneously, crossfading with a screen-space dithered alpha mask. This spreads the transition over a screenSize range [d₁, d₂]. Used by Unity's LOD Group component and Three.js LOD helper. Doubles the draw calls in the crossfade band but avoids hard popping.

Geomorphing (CLOD)

Continuous LOD (CLOD) geomorphs vertex positions toward their target LOD configuration over time. Each LOD mesh records, for each vertex, where that vertex maps in the next simpler LOD (or that it simply disappears). The vertex shader linearly interpolates toward the target position:

// Per-vertex data: current position + morph target position + morph start/end screenSize vec3 morphedPos = mix(a_posLOD_current, a_posLOD_next, morphFactor); // morphFactor = smoothstep(screenSize_threshold_lo, screenSize_threshold_hi, screenSize)

This requires storing two position sets per vertex (doubles vertex data), and careful mesh simplification that records the collapse map. The Progressive Mesh algorithm by Hoppe (1996) is the canonical algorithm for generating the per-vertex morph target data.

Terrain CLOD: ROAM and Geoclipmap

Terrain LOD is a special case: adaptive tessellation based on view-frustum and curvature. The Geoclipmap technique (Losasso & Hoppe, Siggraph 2004) uses axis-aligned rings of geometry centred on the camera, with each outer ring at half the resolution of the inner ring. Ring transitions are geomorphed to avoid seams. This is the basis of terrain rendering in Google Earth, Houdini Terragen, and most open-world game engines.

7. Impostors & Billboard Sprites

When objects become very small on screen (screenSize < 0.005), even a 100-triangle mesh wastes vertex-shader cycles. The cheapest representation is an impostor: a single quad (2 triangles) textured with a pre-rendered image of the object.

Camera-Facing Billboards

A billboard quad always faces the camera. There are three variants:

// Vertex shader: cylindrical billboard (expand quad around instance position) vec3 up = vec3(0.0, 1.0, 0.0); vec3 toCamera = normalize(u_cameraPos - instancePos); vec3 right = normalize(cross(up, toCamera)); vec3 worldPos = instancePos + right * a_offsetXY.x // horizontal expand + up * a_offsetXY.y; // vertical expand (no z) gl_Position = u_viewProj * vec4(worldPos, 1.0);

Multi-View Impostors (Octahedral Impostors)

Pre-render the object from a hemisphere of viewpoints (typically 8×8 = 64 directions) and pack into a texture atlas. At runtime, pick the two nearest pre-rendered views based on the current camera direction and blend between them. The result is convincing from all angles with only one texture sample per pixel. This technique (popularised in Unity Amplify Impostors and UE4 foliage) achieves visually faithful rendering of complex tree canopies with 2 triangles in the vertex budget.

Signed-Distance-Field (SDF) Impostors

Instead of storing colour, store a signed distance field in the impostor texture. In the fragment shader, use the SDF for pixel-accurate alpha blending and normal reconstruction — giving smooth silhouettes that do not pixelate when zoomed. Used for fonts, particles, and small foliage at intermediate distances.

8. WebGPU: Indirect & Multi-Indirect Draw

WebGPU (available in Chrome 113+) exposes the full GPU indirect-draw API, enabling fully GPU-driven rendering pipelines that would be impossible in WebGL.

Indirect Draw in WebGPU

// CPU: create indirect draw argument buffer
// Layout: [ vertexCount, instanceCount, firstVertex, firstInstance ]
const indirectBuffer = device.createBuffer({
  size: 4 * 4,       // 4 uint32 values
  usage: GPUBufferUsage.INDIRECT | GPUBufferUsage.STORAGE,
});

// Compute shader populates instanceCount after culling:
// @group(0) @binding(2) var<storage, read_write> indirect: DrawIndirectArgs;
// atomic_store(&indirect.instanceCount, culledCount);

// Render pass — no CPU knowledge of count needed:
passEncoder.drawIndirect(indirectBuffer, 0);
    

Multi-Draw Indirect

WebGPU's drawIndirect() issues one draw. For multiple LOD buckets or mesh clusters, multi-draw indirect (exposed via the multi-draw-indirect feature, available in Chrome Canary with flags) issues an array of draw commands from a GPU buffer in a single API call. This is how Nanite-style virtual geometry scheduling is implemented on modern hardware.

Complete GPU-Driven Pipeline

  1. Upload: all instance transforms, bounding spheres, LOD thresholds to storage buffers (once or on dirty-only update).
  2. Cull compute pass: frustum + Hi-Z occlusion cull; write visible instances per LOD to compact buffers; write per-LOD counts to indirect args buffer.
  3. Sort pass: optional GPU radix sort by material ID to improve cache coherence.
  4. Render pass: drawIndirect() per LOD chunk — total ~10–20 draw calls regardless of scene complexity.
Real-world numbers: Epic's Nanite renders scenes with billions of polygons at 60+ fps using this approach. For WebGPU simulation scenes, GPU-driven instanced rendering with frustum culling typically achieves a 50–200× reduction in CPU time compared to individual draw calls, enabling real-time simulation of millions of particles or agents at interactive frame rates.