How Audio Fingerprinting Works: The Shazam Algorithm

3D Simulations

🎵 Signal Processing · Algorithms

📅 March 2026⏱ 9 min read🟡 Intermediate · Last updated: 22 June 2026

How Audio Fingerprinting Works: The Shazam Algorithm

Shazam can identify a song in 10 seconds of ambient audio, with background noise, compression artifacts, and off-pitch singing. The core idea — landmark-based spectrogram hashing — is elegant enough to understand completely. Here is how it works.

Written by MySimulator Team · Reviewed by MySimulator Editorial Review

1. Audio as a Spectrogram

A raw audio signal is a time-domain waveform — amplitude vs time. For music identification, time-domain comparison is impractical: two recordings of the same song differ slightly in timing, volume, background noise, and encoding.

The key insight is to work in the time-frequency domain. A spectrogram shows frequency content over time — a 2D image where the x-axis is time, y-axis is frequency, and brightness encodes amplitude. A song has a characteristic "fingerprint" in this space that is robust to many distortions.

Specifically, the positions of peaks (local maxima of energy) in the spectrogram are highly reproducible. The same drum hit at 200 Hz will produce a peak at approximately the same location regardless of recording setup or volume level.

2. The Short-Time Fourier Transform

The Short-Time Fourier Transform (STFT) computes the Fourier transform over a sliding window of time. For each window position t:

X[t, f] = Σₙ x[n] · w[n − t] · e^{−j2πfn/N}

Where x[n] is the audio signal, w[n] is a window function (Hann, Hamming), and N is the FFT size. Typical parameters for music:

Window size: 4096 samples at 44,100 Hz → ~93 ms per frame
Hop size: 1024 samples → 75% overlap between adjacent frames
Frequency resolution: 44100/4096 ≈ 10.77 Hz per bin

The magnitude spectrogram is |X[t, f]|. Typically log-scaled and log-frequency-spaced (mel scale) to match human auditory perception, but Shazam uses linear frequency bins partitioned into sub-bands.

Shazam paper: Wang, A.L. (2003). "An Industrial-Strength Audio Search Algorithm." ISMIR 2003. This is the original published description of the algorithm.

3. Finding Landmarks (Peak Detection)

Landmarks are local maxima in the spectrogram — points (t, f) where the energy exceeds all neighbors within some time-frequency neighborhood. Algorithm:

Partition the frequency axis into logarithmically-spaced bands (e.g., 6 bands)
Within each band and each time frame, find the highest-energy bin
Apply a threshold: only retain peaks above the mean energy × some factor
Apply a minimum distance constraint: peaks must be separated by at least Δt in time and Δf in frequency

A 3-minute song at these settings yields roughly 1,000–2,000 landmarks — a sparse but consistent representation. A 10-second recording excerpt yields ~100 landmarks, enough for reliable matching.

The sparsity is crucial: we don't need to match every frequency, only the prominent peaks. This makes the algorithm robust to additive noise, which spreads energy uniformly, raising the floor but not significantly shifting the peaks.

4. Combinatorial Hashing

Individual peaks alone are insufficient for identification. The same peak could occur in many songs. The Shazam approach pairs each anchor point with multiple nearby target points in a conical zone ahead in time:

hash = (f_anchor, f_target, Δt) where Δt = t_target - t_anchor \in [1, FAN_OUT_WINDOW]

For each anchor, if we pair it with up to F = 15 target peaks in the zone, each anchor generates 15 hashes. With 1,500 landmarks in a song, that's 22,500 hashes per song stored in the database. Each hash also stores the absolute time offset of the anchor: (hash → song_id, t_anchor).

The combinatorial space is enormous — 3 integer values each in a wide range — so collisions across different songs are rare. The exact probability depends on hash space size; with even 20-bit values for each of the three components the odds of a random collision are < 10⁻⁸ per pair.

function hashPair(f1, f2, dt) {
  // Pack three integers into a single 32-bit hash
  // f1, f2: frequency bin indices (0-511)
  // dt: time delta in frames (1-50)
  return ((f1 & 0x1FF) << 23) | ((f2 & 0x1FF) << 14) | (dt & 0x3FFF);
}

5. Database Lookup and Time Coherence

At query time, compute the same ~100 landmarks from the 10-second recording, generate ~1,500 hashes, and look each up in the hash table. Each match returns (song_id, stored_t_anchor).

A naive count of matches per song would work poorly — hashes repeat across songs even if rarely. The key is time coherence: if the query excerpt starts at time t_q within the song, then for every matching hash the offset Δ = stored_t − query_t should be constant for all hashes belonging to that song.

Plot a histogram of Δ for each candidate song. The correct song produces a sharp spike — many hashes align at the same offset, while random collisions scatter uniformly. The song with the tallest, narrowest spike wins:

score(song) = max_Δ { count of hashes with stored_t − query_t = Δ }

A score above a threshold (say, 20+ coincident hashes) is accepted. In practice Shazam's database holds 11 million songs. Their 2023 engineering blog notes average identification time < 1 second including network round-trip.

6. Why It's Noise-Robust

Additive noise: Raises the spectral floor uniformly; peaks remain the local maxima so long as signal-to-noise ratio > ~−10 dB in relevant frequency bands.
Volume variation: Peak detection uses relative thresholds (no absolute amplitude scaling), so a quiet recording and a loud one produce the same peaks.
Bandpass distortion: Phone speakers/microphones cut frequencies <300 Hz and >8 kHz. Shazam uses fingerprints from 300–2000 Hz, well within phone bandwidth.
Tempo/pitch variation: The algorithm does NOT handle arbitrary pitch/tempo shifts. (For cover detection, more expensive algorithms like acoustic fingerprints via deep learning are needed.)
Partial recordings: The time-coherence check is self-aligning — you don't need to know where in the song the query falls.

7. Implementation with Web Audio API

The Web Audio API's AnalyserNode provides a real-time spectrogram directly in the browser:

const ctx = new AudioContext();
const analyser = ctx.createAnalyser();
analyser.fftSize = 4096;           // frequency resolution
analyser.smoothingTimeConstant = 0; // no temporal smoothing

const mic = await navigator.mediaDevices.getUserMedia({ audio: true });
const source = ctx.createMediaStreamSource(mic);
source.connect(analyser);

const bufLen = analyser.frequencyBinCount;  // 2048
const spectrum = new Float32Array(bufLen);

function captureFrame() {
  analyser.getFloatFrequencyData(spectrum);
  // spectrum[k] = magnitude in dB for bin k
  // Frequency of bin k = k * sampleRate / fftSize
  return spectrum.slice();
}

// Collect frames at ~10Hz hop, detect peaks, hash pairs
setInterval(() => {
  const frame = captureFrame();
  const peaks = detectPeaks(frame, /* threshold= */ -40 /* dB */);
  generateHashes(peaks);
}, 93 /* ms, matching fftSize/sampleRate hop */);

A full JS implementation (~300 lines) can fingerprint audio locally and compare against a pre-built hash map. Open-source reference implementations exist as npm packages (local-audio-fingerprint, fingerprint-js) for experimentation.